Presented: 22 June 2020
Presenter: Liz Stokes
#7 in the 8 webinar series of the FAIR data 101 training webinars.
Hi, everyone. Welcome to the fourth and final module of The FAIR Data 101 course. My name is Liz Stokes. I’m from the Australian Research Data Commons, and I would like to acknowledge the traditional owners of the land on which we are meeting today. For me, based in Sydney, that is the Gadigal people of the Eora Nation. I’d like to pay my respects to Elders past and present and acknowledge that this land has not been ceded. I would also like to extend a warm welcome to any First Nations people who are joining us today for this final module, Reusable. Okay. So today, let’s get into this.
So just front matter, business first, please use the question or chat component for any questions or if you’ve got any tech issues at all today, or if suddenly you can’t hear me. The NBN technician visited on the weekend, I’m pleased to announce, and repaired not one but three broken cables between us and the node. So hopefully today works. I do encourage you to use the channels in the Slack also for after this webinar if you have further questions coming along. You can also tweet out using FAIR101 or ARDCtraining on Twitter. There is also a link to our Code of Conduct that we have for this course to ensure that it remains a friendly and accessible opportunity to get into the FAIR Data Principles.
So today I’m going to start with an extended permaculture metaphor. Excellent things grow in compost, like mushrooms and other fungi. So for my first visual metaphor, let’s ground reusability in a permaculture lens which concentrates on the health of the compost and soil for excellent things to grow. Okay. So this reusability is going to concentrate on practical applications, okay? The ultimate goal of the FAIR Data Principles is to optimize reusability. So the umbrella principles that data and metadata are richly described with a plurality of accurate and relevant attributes is further defined by three sub-principles or we could call them vice principles, if you wanted, which highlight the importance of clear and accessible usage rights, data provenance and domain relevant community standards in supporting reusability.
So how are we going to get into these? Here are some concepts that I’d like to step through in the next 40 minutes or so, and then we can have questions and answers. I will probably keep these things on a fairly high level and I know that’s probably a little bit of an oxymoron to go high-level and practical, but I don’t know, that’s probably the tension that we live every day. However, Matthias, on Wednesday, is going to expand on some of these into detail at a greater depth, shall we say? And look at FAIR beyond data into the associated outputs in the glorious ecosystem that is research.
So in practical terms, how do we talk about reusability and what aids data reuse? In one sense, it’s all about the metadata, keeping our eyes on the prize and looking at what the metadata is exposing and facilitating. So data that is available for reuse is accessible. And these are just sharing with you some thoughts that occur to me off the top of my head when I think about what that might mean. So I translate that into, at the click of a button. I don’t have to go deep into scrolling or any convoluted processes to actually access the data. The data is also well-described. It does what it says on the tin, for example, which makes it easier for searching and finding and retrieval.
The data is also familiar, okay? When I’m thinking about familiarity, I’m thinking about things like formats that are in current usage. I’m thinking about the way that data is expressed or encoded, appears in ways that are familiar to its users. That it is easy to cite, so it’s relatively painless for me to tell you where it came from. And also that it is licensed, okay, so that the providers, the creators of that data, are very explicit in how you or I are allowed to use that data.
So let’s start pulling these FAIR Data Principles apart. So at one level, rich description, that metadata are richly described, with a plurality of accurate and relevant attributes. Is an encouragement for the metadata author, whether they are humans, machines, or data librarians, to be generous with their information, generous with volume and specific with regards to the structure of that data. For me, this brings to mind two things. Firstly, a rich or thick description, and I’d like to get into a little ethnographic story. And secondly, a certain enthusiasm for machine-readable metadata schemas, or rather, documentation of metadata schemas that is machine-readable and fair. So, oh dammit, I probably should have put a reference into the slides. I will add it afterwards. Look, it’s looking at me in my notes right away.
So rich description brings to mind Clifford Geertz’s maxim for anthropologists to provide a thick description in their field notes, that is to go beyond factual or literal descriptions. He provides an example of reporting a wink. So instead of describing an eyelid stretching over an eyeball, he encourages ethnographers and anthropologists to consider talking about providing the context in which that wink might have occurred, so looking at the social and cultural things that are going on as well a literal description of what is happening.
I bring this up because I’m talking now, in this extended metaphor, about anthropological research practices, okay? These highly descriptive entries and monographs of anthropological research are all part of doing that kind of research. So for other researchers to glean insight requires deep and sustained reading and even if this does become tedious for the human, it actually becomes impossible for the computer, which is unable to filter strings by itself unless someone has manually marked up that text or provided explicit structure to the data or digital information there.
So I’m just going to park that tension here for a bit and then move onto unpacking attributes, which is the aforementioned enthusiasm I have for machine-readable documentation of metadata schemas. And another nice visual metaphor. So remembering that the core value of metadata is that it is structured data about data. I’ll remind you of those concepts of data models that we were talking about in our previous module. So metadata assumes that the research data we are concerned with is always already structured and that this principle goes for the metadata which describes or structures the research data, as well as the data itself. So for the fair sharing of research data, this accurate and relevant attributes point towards what give us basic information about it. Information about how to find it, information about how to find it, what it’s about and the permitted usage. Content descriptions should cover both the…
Hi, everyone. We’re sorry about the technical difficulties. If you will please bear with us, I will attempt to just take over from where Liz left off.
Am I back?
Oh, sorry. I do not need to take over because Liz is back.
All right, Liz. Let me just…
Where did I get up to?
You had just gotten onto this slide with the rice paddies.
Oh, right. Okay. Excellent.
On with the show.
Awesome. Thank you. Sorry, everyone. I appreciate your patience. Clearly, I did not touch wood when I extolled the virtues of the NBN. Moving right along. Okay. So let’s have a look at this metaphor that I have thrown up for you. So as I was saying, talking about the content and the context, and giving a rich description of metadata and this research data, so those practices in research practices that are familiar with us are not instantly translatable sometimes to computational processes. So when we do want to do something like that, we have to take a few steps of structuring our data in ways that can make it possible to harness the power of technology.
So, for example, when we’re making decisions about weighing up the purpose of what we’re describing, we’re doing that against the utility of describing it. We might not need to describe everything in a rich, semantic ontology, maybe it’s only a few components. It really depends on weighing up costs and time and effort that are available to us. So in this balance between describing everything and what is fit for purpose for our users, whether they are researchers or data librarians, stewards, et cetera, sometimes it might mean that not all of the detail goes into one long notes field, for example, which even though it’s tedious, or could be delightful for some humans, it is generally impossible for the computer.
So here we have fields which are arranged according to the context in the shape of a landscape here. So now I’m going to really go into this metaphor. Obviously, they are also impacting the shape of the landscape and how that landscape is exploited for agriculture. In this image, not all the landscape is structured terraced rice fields. The farmers have made a decision about how to optimize the land for farming. As you can see, it’s not all one uniform or level field, literally. And in some cases, you can see it’s not structured at all, potentially around the borders of these structured rice paddies, we can see maybe some banana palms or other palms around in obviously very rich ecosystems themselves.
So now I’d like to move into an example around Darwin Core, okay? So I’ve mentioned this, I think, a couple of times, Darwin Core being a metadata schema for describing biological things, using standardized ways of what terms are and the elements that they reuse from other standard vocabularies. So what Darwin Core does is they use the conventions of schema documentations, which are themselves specific standards, to aid machine parsability and human readability once you know what you’re looking for. Okay, so what I’m going to do is, oh, navigate to this website here. Oh, okay. I trust that you can see the Darwin Core basic vocabulary documentation here.
So what this documentation says is lists, we can see some versions of the vocabulary, but what I want to draw your attention to firstly is that under this section four, the term lists that are part of this vocabulary. You can see we have some fairly standard information about different terms that are incorporated into Darwin Core, okay? So Darwin Core actually borrow terms from the Dublin Core legacy namespace and also from their terms namespace, so from their terms and their elements. There are also some other lists here and then here you can see, under this IRI, which is like a URI, it’s a resource identifier, okay, it’s a persistent link, you can see that Darwin Core define their own terms for the purpose of biological description, but they also reuse some from Dublin Core.
So I’m going to click over to the DC terms, Dublin Core terms, and show you here again that they are providing us some information about what they’ve created and how and under this section for, again, terms that are members of this list. Here they are starting to provide us with more information about what terms they are using and what has been borrowed. For example, there is a location term, okay, which is called Location. They provide a definition, a spacial regional named place, and that it actually replaces a previous term that they had specified. Okay? The modification was in 2008. The same thing here, actually, incidentally, has happened with access rights. So they are using the Dublin Core terms for access rights to provide information about who can access the resource or give an indication of its security status. But this documentation here is showing the term that it replaced, which was previously specified in the Darwin Core term list. But now they have decided to reuse a term from Dublin Core. Okay?
So you can see that you don’t have to make all the right decisions at the start maybe, and I’m sure they had some very good reasons for choosing to have metadata about access constraints rather than access rights. And in fact, if I even follow this link, it would take us to the Darwin Core Quick Reference Guide. So this is where the machine readability documentation also is complimented by a much more human readable documentation. Because here in this Quick Reference Guide, we get to see when we look at some of these attributes here, information that’s going to be much more relevant to a human interpreter of the Darwin Core terms than a machine.
So humans need to know, we like to have a bit of a comment and a definition and examples. Examples are really good for humans, so we know how to apply it and what to expect. We can see that here in this, for example, modified term, it relates to date, time, when a resource has been changed, and also that’s it conforming to a specific ISO standard about how to present that time. Okay? That’s the YYYYMMDD and then the time, little section after it. Yeah.
But anyway, so we’ve got standard ways of talking about metadata that’s useful for humans to interpret and use and apply. We also have, if I went backwards, or I’m not sure how I could go backwards in that right now, to the kinds of terms that the computer is going to want to know. So they want to know a bit more about the structure and whether something’s a class or a property and the semantic level on how the terms relate to each other, okay? Because, knowing what values can be aligned or how often a field or element can be repeated and whether it’s mandatory or not, so the usage of that metadata description, then has implications on the things that we might want to do managing that research data at scale, okay?
So decisions that, for example, a data repository might want to make when they’re considering bulk ingest of records or transforming records so that they can apply some long-term preservation work, this is where it matters what the rules are for managing that data. So we really like the metadata to aid that reusability to be very clear and very accurate when it comes to identifying attributes of that data. This is getting really circular reference-y, isn’t it? Okay. So let’s move on. Great. Darwin Core. Okay.
Now it’s time to talk about licensing. Oh, I better speed-up. Okay. So first I’m going to address licensing and then data citation, okay? When the aim is actual reuse of research data, this principle encourages us to be clear about how people can do that. So when a data citation tells you where you got data from and can aid provenance in that way, a license sets out your expectations for others to follow. I just kind of have to acknowledge that this is probably the most meta of slides I could ever give you right now. I’ll try not to get us lost.
So many licenses actually feature attribution as an expectation of usage. And as this slide says, Quill West, actually, “The purpose of attribution is to give credit to the original creator of something you are using. It relates to the thing and is a legal requirement of using openly licensed works.” Okay? Because this fixed the fact that a licensed work, a license is a legal instrument. So although licensing research data tends to come up at publication points, research data could be licensed during any part of the research lifecycle, during planning or negotiating with potential collaborators, for example. I’m going to concentrate on Creative Commons licenses after I make a few notes about Australian copyright law, drawing from that ARDC Research Data Rights Management Guide in the next one.
But as you can see, I am reusing this slide, this particular slide, which is from Citations Verus Attributions by Quill West. They have licensed it through the Creative Commons attributions license, CC-BY 4.0, which is an international license. You can see even on this slide, there is a picture of a LOL cat which has been attributed and under a license for Creative Commons Attributions and ShareAlike, 2.0. They even acknowledge that this was a derivation from the original work, which was pretty much the picture of the cat without, “Oh hai, I ophen soarzd dis for u.”
Okay. Back to Australian copyright law. So, look, it’s complicated, but it’s a fun time, okay? So the conventions of academia to comply with copyright have developed citation and attribution practices, okay? While it is true that Creative Commons licenses can only protect material in which copyright or similar rights exist, there are two important considerations at play. Firstly, the strict determination of whether copyrights exist in a dataset can be complicated and some datasets will definitely attract copyright. Secondly, for those data publishers and researchers who wish to broadly share their data, protection is not the primary objective in their selection of a particular license or rights statement. Rather, in that case, the dual objectives in the selection of a license are, or should be, to unambiguously declare to everyone that the data can be reused and to indicate that the licensor would like to be attributed when someone does so. So the bottom line is regardless of whether copyright exists or not, you can still apply a license to instruct how people might use the data you are making available.
So, as you can see on this slide, Australian law doesn’t recognize copyright in machine generated data but it does recognize the impact of human authorship, which is demonstrated creativity in selection and arrangement of data. So if you have some raw data and you have analyzed it and corrected, reformed, made modeling choices, this may actually influence whether or not copyright subsists in that dataset. But it is always a case by case basis in Australian law. The final important point is to know that rights in data usually rest with the creator of that data.
So this is why we advocate for the use of licenses to make something usable, when we want to be sure that we have a method of giving people a license to reuse the data. So we don’t want to be resting on the conventions of citation alone. Creative Commons’ suite of licenses have various levels of usage which you can bolt on to your data assets. And it’s important to acknowledge that they don’t waive or replace copyright. This image here we have is the Attribution License, which is quite popular because it’s relatively easy to do and match well to your standard academic citation practices. Behind the CC Attribution License is a legal instrument that works internationally. That’s the version 4.0 there. To apply the license, you display this image and the words below, which link to a human readable version, a machine parsable version and the legal instrument, which you are welcome to read. Okay.
So let’s have a look at what some of these licenses look like in the wild. I’m using an example here from the Australian Ocean Data Network Portal, and this is a dataset about Australian phytoplankton. As you can see there, they’re using the Attribution License, CC-BY. Okay. The portal actually licenses all data within their repository as CC-BY in their data use acknowledgment statement. Over here in the metadata record, and I’m providing a screenshot of that, are some additional constraints to attribution depending on the parts of the dataset that you might be using and how to provide attribution.
Another example is from the Atlas of Living Australia and this is for the little magpie record, the Australian magpie, I should clarify, now that I know the difference roughly. Gymnorhina tibicen. So the Atlas of Living Australia actually gets data from a lot of different data providers and they all have, “Are welcome to use different licenses.” This particular image here, which is attached to the record, to the magpie occurrence record, is licensed under a CC Attribution non-commercial, that’s what NC stands for, by a contributor called Wingspanner. That is just for the use of the image, okay? That is not the whole record. There are other components there.
So, if I do that special screen sharing thing again. Oh, okay, let’s try not to make this terrible. Ah, here we are. Okay. So looking at the actual record here, I’m going to scroll down so you can see on this page the provenance or who is providing data to this record is shown by this little, “Provided by…” and the links to the datasets which are being supplied, contributed from various data partners. Actually, if I scroll up, you can look over here and under the data partners tab, I’ll just click that now, we can see which data partners are providing what datasets and under which terms. You can see a whole range of CC-BYs, Creative Commons licenses. Sometimes they have non-commercial limitations on them, sometimes they only want, only specify, attribution, okay? Cool.
So this is a nice slide into provenance and I’m probably going to finish up with this, looking at provenance, and then I’ll let Matthias take you deeper into the domain relevant community standards. So what does provenance mean? Okay. Well, ultimately I think it’s about asking what it is useful for the users, the researchers, to know about how the data was created, okay? Often, it’s not until somebody goes to actually reuse someone else’s data that they realize what is actually practically useful in terms of how the data was created or generated and what processes had been applied to that data.
So provenance is something that allows people to trust data, so that they know where it comes from, how it was created, and can be aware of limitations. For example, if we’re thinking about a temperature sensor, it might actually only do measurements in whole degrees. Now, we all know that temperature changes over, I suppose we could say, degrees of degrees. So if you had a dataset that only reported temperature at particular times in whole degree terms, then you would need to be careful about visualizing that data and the implications for further analysis when that data had been normalized in that way.
Here is a tale of two sensors, and I would like to acknowledge this is actually from Matthias, if you are wondering about the provenance of this example, when we were talking about how do we talk about reusability and what is helpful information to know about sensors that may have been used in the collection of data? So these sensors, they collect data on humidity readings and temperature. DH11 and DH22, 11 is the blue one, actually, 22 the slightly larger one, have different ranges of humidity readings. They optimize differently for different temperature ranges and they also perform differently. So they have different rates of accuracy and sampling rates and there is a slight cost difference.
So when the sensor or instruments, any kind of research instruments, have similar names, the provenance is important because different capabilities will produce different results. So if we were using this particular sensor in a data collection activity, it would be very helpful to be able to link out to record the exact sensor name and then link out to the properties or attributes of that sensor because then we would know what degree of accuracy we can infer from the results of the data that that sensor collects. So our decisions, in terms of the FAIR principle of reusability, is about being clear and accurate when it counts.
I’m going to skip this one, I think, in the interests of having a chat about and answering any questions you might have. But you can have a look through. I’ll share the slides after this and you can have a little look through here. This is a nice example of how changes in the actual processing analysis pipeline mean that research teams can get very different results from the same dataset. But it’s probably time for me to wrap up right now. There are some links in it. It’s a good read. Hey, Matthias.
Hey, Liz. Thank you very much for that and thanks for a graceful recovery that saved me from having to deliver your presentation. We do have some questions in, but there are time for more questions than the number of questions we have, so please do type your questions into the question module as Liz and I address these first ones.
Okay. So back at about the 12 minute mark, you were talking about the relevance of metadata attributes and someone in our audience says that sounds highly subjective and asks whether we have any guidance for how broadly they should think with respect to assessing relevance.
Ah, I’m going to be very candid and it kind of goes to, is… No, okay. So my candid answer is what can you be bothered with? Right? Perhaps the more prudent response would be, what is fit for purpose? So you don’t have to collect all of the metadata, but what is the metadata that really counts? For example, for a repository to provide a reasonable finding aid to the contents in their data repository. So how much aboutness do they need to know about the research data in order to make it easy for people to find the stuff that’s in their repository? Also, this is a pun, it’s going to happen, how fair is it on researchers and data stewards, data curators, people contributing data to repositories, to ask them to provide extensive, descriptive metadata about what they’re producing?
So you’ve got to balance it up and often, I think, you want to be looking at what metadata can you automatically pull from other organizations or other enterprise systems first? Then it’s the last resort that you want to ask the contributors to put in extra data themselves. Also, I guess, it kind of depends on the community standards, what people find acceptable to provide.
Okay, great. Thank you. Okay. Next we have a jargon busting question. So it was possibly a little confusing in that list from the ALA, how there were all those different kinds of CC licenses. So this particular question asks, is CC-BY a different license to CC-BY 4.0?
Yes. I will expand. There are different versions of licenses. So as the practice of openly licensing outputs and things develops, there are different engagements which work according to different legal jurisdictions. So for example, the Creative Commons Attribution License, Version 3 or 3.0, that works in an Australian context, okay? Version 4 is the international version of that license and it also happens to be the latest license. So people who have come to a position where they’ve gone, “Well, you know what? Let’s just apply the international license because then that will just work everywhere and we don’t have to worry about gate-keeping and geographic borders. That is just one step too far.”
Great. Thank you. Okay. Another question here, would you say there is a preference when choosing a Creative Commons license for datasets, especially when we want data to be open? Does it depend on the researcher’s preference or choice?
Yes. Yes. What I haven’t talked about at all are institutional policies around intellectual property, okay, and how that plays into who is providing the data and for what purposes. So this is a familiar tension to many of you publishing HDR student theses, okay, and managing different rights there. But go back to the question, because I think I was coming to the point, but I’ve forgotten it.
Yep. So my understanding, is there is a preference for choosing a Creative Commons license over perhaps any other kind of license when it comes to datasets?
Yeah. So I think that the Creative Commons licenses, I would recommend them because they are straightforward and they work well for the purposes of sharing data. Sometime your data could be really, really old, okay? Or your resources could be really, really old. So in fact, copyright may not even come into it and you may be able to use something like a public domain mark, which is another thing that the Creative Commons licenses include, which is putting things in the public domain and, hey, there are weeds there. Okay. I can see Matthias starting to feel anxious.
Yeah. Okay. So we’ve got one last question and I might handle this one if it’s okay, Liz. So Liz, you asserted that rights to data usually rests with the creator. Can an institution assert their right to IP for data generated by academics in their employ? Also, could a funding body assert the same as a part of the employment or funding contract? Now, the reason why I wanted to answer this one is because I have been through this process. So as an employee, an academic, as an employee of the university, anything that they generate, any IP they generate during the course of their work, would naturally fall to their employer unless there’s been a contract signed saying otherwise.
So, for example, many institutions will allow their academics to hold the IP of their research outputs, their publications, sometimes even their teaching materials. But these agreements don’t generally cover data. In fact, in the past I have signed an extra contract. So I had my employment contract but on top of that, when I worked on a particular project, I was asked to sign an extra piece of paper that explicitly stated that the output of this project belonged to the institution. Now, strictly speaking, that second bit of paper wasn’t necessary but it was certainly an instrument that the institution wanted to use to protect its own IP.
The same goes for funding contracts as well. So for example, the ARC does specify that publications should be released or should be made openly available and other funding bodies do the same. Now, they don’t necessarily go as far as saying that they own the research outputs, but they certainly do stipulate a particular kind of licensing or access that should be used there. Did you have anything to add to that, Liz?
Nope, I think you handled that wonderfully.
Okay. All right. That’s actually all the questions we have and I am sorry, we’ve run a little bit over time, but I will hand over to you, Liz, to wrap up.
Oh, that’s it, everyone. Matthias will follow up on Wednesday with a bit more detail in community standards and looking at reusability with reference to reproducible workflows, so stay tuned for that. We’ll have quizzes and activities ready for you for Wednesday, I hope. Okay, see you later. Bye.