Presented: 13 May 2020
Presenter: Liz Stokes
#2 in the 8 webinar series of the FAIR data 101 training webinars.
Good day, everyone. Hello, welcome to webinar two of our Findability module into this FAIR 101 course today. My name is Liz. I’m from the ARDC, and I would like to acknowledge the Gadigal people of the Eora Nation, who are the traditional custodians of the land on which I’m standing here. I would like to pay my respect to elders past and present, and acknowledge any First Nations people who are joining us here today.
So welcome, and thank you very much for your patience as I made my way through some incredibly poorly timed text snafus today. So thank you again. I would like to introduce to you the team that is bringing you this course. We’re almost all here. We have Andrew White and Nichola Burton from our engagements team, if you’d like to wave to our people. Matthias and myself are from the skilled workforce team at ARDC, and we have webinar maestro Susanna from our communications team in all her headset glory, keeping this webinar’s ship running today.
So I just wanted to make sure that everyone here was able to see the people behind, and it’s not just the presenting, there’s a lot of stuff that goes on behind the scenes. We are like an iceberg, aren’t we? Okay. Well, on with today’s show.
I’m going to concentrate on the role that metadata plays in facilitating FAIR data. So thank you for your enthusiasm so far. It’s been great to see so many introductions on the Slack channel, and thank you for your lovely feedback from our first webinar. A few participants have asked some really interesting questions after that, so we’ll respond to these hopefully by the end of this week, and we’ll put answers to those in the Slack general channel.
Also, for the links that I refer to in today’s presentation, Matthias will share these with you in the chat as we go along. So as an optional activity, you can follow along with me if you like. Okay, so if you do have any questions for today, as we go along, please pop them into the question window. You’ll notice also that there might be a chat window as well, and you can have a go at that one too, but just preparing you, we’ll probably put all the links into the chat window, hopefully that appears smoothly for you, and of course, if you’re the tweeting type, feel free to use hashtag FAIR 101 as we go along. Okay, let’s move it along.
So, welcome. There’s the Slack information and the Twitter. So if you haven’t joined our Slack yet, please do so, find your equivalent at another institution in the introductions channel, and please use the general channel for questions or comments for everyone.
Now there’s also a link to our code of conduct. If you haven’t had a chance to go and look at that, please have a go. Now today’s overview. So we’re going to get into metadata. Our big focus there is looking at what rich metadata through research datasets look like. We’ll have a little poke around Research Data Australia, and I’ll also highlight a few recommended discovery platforms, and wrap up with the activities, quiz and some preparation for the community discussion next week. So if you are metadataly inclined today, I hope this webinar meets your expectations. Whoa! Running ahead there.
So the FAIR principles we’re going to cover today are that data described with rich metadata, no, I don’t mean bling. That metadata do not hide or disguise the identifier of the data that they describe, and that both metadata and data are registered in a searchable resource.
So all of this speaks to how metadata helps make things findable at both the item level and the repository level. So metadata, let’s get meta. My garden variety definition for metadata is that metadata are structured data about data. So these are standardized methods of describing research data so that humans and machines can understand what the data are about.
Metadata are the primary tool for finding and retrieving data about almost anything, and the more thoroughly something is described in metadata, the easier it is to find. They are also organized in schemas, which also have their own metadata standards. It’s really possible that I might go into a rabbit hole at any point because of this metadata topic, but I will do my level best not to fall down quite so early, and the main point of having structured data about data is that it is machine-readable and human-readable. Though, perhaps if you’re a beginner at this, you may question the human readability of it, but once you get used to it, it’s really nice.
So, types of metadata. Broadly speaking, these are a few different types here, and I’ve described these on the slide. It’s useful to know the purpose of different types of metadata so that you have a rough guide for assessing the quality and completeness of the metadata, and ultimately the findability and fairness of a research dataset or collection.
Okay, so here we have descriptive metadata describing the content and context, helping people make value judgments about the research data. Structural metadata contains information on the relationships inherent to the dataset, how it’s assembled and versions for example. Administrative metadata is for the people who are managing and curating the data, so potentially quite a large portion of our participants today. It’s quite a large category, and for the FAIR principles, I just wanted to highlight these following subcategories. Now, technical metadata, which is useful for systems, software, and services, is the data in a compatible file format for example. Access and rights metadata tell us who is allowed to access the data and under what conditions, and preservation metadata keeps a record of actions taken to preserve the data and metadata into the future.
So, let’s take a look at some Crossref metadata in action. So this diagram shows how, when publishers register with Crossref, the metadata about their resources, and metadata about research outputs, including the all-important persistent identifier, as we learnt from Matthias. The DOI is exchanged with Crossref and all the systems which use metadata to credit and cite the work, report impact of funding, and track outcomes. Crossref has just released some new educational materials and documentation on their website, and I recommend you check it out. It’s really nice looking and quite clear.
Let’s move on. So here are a few examples of metadata schemas, and this is one of my techniques for not going down into a rabbit hole. I’m just going to give you a couple of examples here.
So we’ve got Dublin Core, which many of you may be familiar with, it’s a very common metadata schema. It’s used for describing resources on the web. Schema.org is what major search engines use and is growing in its use across commercial applications. Datacite and Crossref, both describe research data and research outputs, and these two are the kinds of metadata schemas that I would encourage you to take a look at in a deeper way in your own time, if you’re interested in taking this a little further, okay, and of course, here’s a little hat tip to a couple of disciplinary metadata schemers. So the Data Documentation Initiative is one that has been developed for social sciences, and Darwin Core is one that is used in the biological sciences, but I should make sure you have a word of warning because metadata reflects what humans and machines thought was important to know at that time.
As technology and services evolve, so too do the standards we have for sharing and storing data. So when you’re searching for data, or you’re creating metadata records for data, keep in mind that there isn’t really one single standard practice, or universal standard, or vocabulary to rule them all, right? The methods for searching, and the terms that people use are driven by why people are looking, and how that data is stored.
So let’s look at some examples. Move that over there. I’m going to show you a couple of excerpts of metadata records in XML format, Extensible Markup Language format, I should say. Then I’ll take you over to Research Data Australia, and look at the metadata of another research dataset.
So what we have here, and I hope this doesn’t freak you out too much. I’ve certainly been there. Here’s an except of a metadata record for some software actually, and you can see at the top, under the resource, we can see, in fact, the metadata schema that they used. They using XML namespace, and the schema location, that’s cut off, but we get to see schema.datacite.org, and there’s the metadata schema that they are using. So the metadata identifies the schema, and it also identifies the identifier, okay?
You can see in this element here under identifier, it tells us what type of identifier it is, a DOI, and then we get the content or the value of that metadata, and here it is, there’s the DOI, and just draw your attention down here to the metadata about the creators. So we can see there are a few different creators, but this first one is pretty special because AT Zielinski has also got a name identifier attached to them, and we can see the scheme URI, Universal Resource Identifier, is ORCID, and we can see the name of that identifier scheme is ORCID. So it also tells the machines that, and here we can see the 16 digit ORCID identifier in the value component there.
So this is an example of how metadata has its own standards for being set out, and there are a few conventions that I’m not going to go deep into, really, right now, in order to help the metadata and the data be readable from humans and machines. Let’s look at another example here.
This one… I apologize for the text being so small here. I’m going to draw your attention to… This is another research dataset metadata record, and the title of this is identification of putative novel specific targets of mir-210 in A549 human adenocarcinoma cells. What I want to draw your attention to is in the subject element here, you can… Oops. In the subject here, we see not just keywords or subjects which tell us what this research dataset is about, but actually, we can see some attributes that are specified, and this first one here might be familiar to those of you who work in the health sciences and are familiar with MeSH terms.
MeSH stands for Medical Subject Headings, and it’s a vocabulary of medical terminology that is used in many medical and health databases, and you can see there, they’re not only just using MeSH terms, but they’re also using a few other different vocabularies there. We’ll come back to vocabularies and controlled vocabulary and some AIDC vocabulary services later on in this webinars series, but for now, I’m just highlighting that there.
Now I think it’s time to have a little look at Research Data Australia. So you can see the URL up the top there, researchdata.edu.au, and I’m going to invite you to hop along and see if you can jump over there. So here we are in Research Data Australia. I wanted to introduce you to this because it’s a metadata aggregator that we run. You could also call it a repository of metadata. So it is a repository of sorts, but what we do is we aggregate the metadata from lots and lots of different research datasets from research repositories around the country.
Let’s see. Two-Rocks mooring. So what I wanted to do here was take you over to one of these metadata records, here we go, Two-Rocks mooring, and show you that this record here has been contributed to us by CSIRO from their data access portal. So if I scroll down, you’ll be able to see there’s a bit of information there. Some brief information, there’s a brief description, and we get some important information, such as the licensing and the access details, and who it’s related to, and there’s some time period and geographic location information, and down here, there’s a little section for the DOIs, and we can see… Okay, so there’s a local DOI identifier, and there’s also the official DOI identifier here, and look, there’s a little DataCite logo.
What I’m going to do though, is actually scroll right down to the bottom and click over here on the bottom right hand corner, which says registry view, and this allows us to see.. Aha! This allows us to see the same metadata record, but now we get to see what metadata is being used and what attributes. So again, if this is the first time that you’ve seen a metadata record in nested table format, breathe deeply along with me, and we’ll just have a look and see what we can see, okay?
So up here, we’ve got some basic information about who it came from and how, and here we see the name information, and you get to see how this view separates the values from the attributes and the metadata elements right here. So if I highlight here the identifier field, so this has metadata for identifiers. We’ve got two of them, one of them is a local identifier, one of them is a DOI, and here are the separate values, and another good example for showing how conventions of setting out metadata is by looking at the date. So within coverage and temporal coverage, so time.
Here, we have a date and this is the date format, W3C date time format, okay? Tells us about what type of date it is, the start date or date from, and here is the value, and you can see that it’s written in that date time format. Y-Y-Y-Y-M-M-D-D. I can feel I’ve gone down into a data librarian rabbit hole. So I’m going to pull back out of that and see if I can switch over, to come back to my main screen.
Okay, so coming back to discovery platforms there. So you may be wondering, “Well, where can I find out about other metadata discovery platforms, and how do I know how to find other research data repositories that may exist?” Well, the answer here is provided by DataCite, and they have a site called re3data, it’s called re3 because it’s got the registry of research data repositories, and this collects information, indexes all of the research data repositories that exist in the world. You can browse this registry by a subject or country. So it’s really useful if you’re looking for things in your own backyard, or if you’re looking for repositories according to a particular subject or discipline, and the metadata that re3data uses also covers terms and standards, and licenses for those repository. So it’s actually quite useful if you’re looking to compare different research data repositories for your purposes.
Okay, let’s move on. Zenodo is a multidisciplinary data repository, which is hosted by CERN. I’m highlighting it to you because it’s very big and it’s quite open, so you could almost put anything on there. They have a wide variety of data types and content types on there, and it’s a very easy way to get a DOI for a resource and link it to your ORCID. As you can see over here, we’ve got an ORCID for one of the contributors, and you can probably get into that a little later. Actually, I would like to highlight, Zenodo has a communities function, which is essentially a grouping tool, which allows you to collect certain resources across the Zenodo repository and put them together in a shortlist, or in a community. There’s actually a great community that the Digital Curation Centre curate for research data management resources. So I recommend you have a look there and explore that community for their resources.
Moving over to Data Dryad. Data Dryad was one of the first really popular research data repositories. It used to be based in the life sciences, but now it is much broader, and they’ve recently launched a service for institutions to host research data for institutions. They currently host data for a lot of journals, and Data Dryad is a service that is provided by the California Digital Library. They have a data processing charge, but they do attempt to preserve indefinitely, but of course, it’s up to the researcher or the contributing people to prepare the context for that. One thing I would like to note is that all content in Data Dryad has a CC0 license, which means puts it in the public domain. They also have a great FAIR guide already, and there’s a link to their best practices in making data FAIR in links that I’m sure… I trust Matthias is sharing with you right now.
Okay, another good repository is ICPSR, which stands for the Inter-university Consortium for the Preservation of Social Research. So you can see with a name like that, you’d want to learn the acronym. I’m highlighting ICPSR here because they have been awarded the CoreTrustSeal, and the CoreTrustSeal is a process of certification for repositories, and it’s a really solid step towards formalizing your organization’s commitment to facilitating the creation and use of FAIR data. So if you’re interested in taking that further, I would recommend you have a look at CoreTrustSeal certification.
The other reason that I wanted to highlight ICPSR is because they use the DDI schema that I mentioned earlier, Data Documentation Initiative. So this really helps researchers produce material that is findable and is well structured into the future because all contributions to this repository have to comply with the minimum standards for that metadata schema, and they also provide some really great guidance on preservation and archiving of social science data as well, and let’s wrap up our hit list of recommended repositories with a hat tip to the Australian Data Archive, and I’m sharing this example here because they also have achieved certification for the CoreTrustSeal, but also because not all data that are available in the Australian Data Archive are actually open. Now they do provide mediated access to some data, but certainly the metadata is open and available, and they use the Harvard Dataverse infrastructure. If you are wondering on how they do that.
So it’s nearly time for me to wrap up about that. So let’s come back to metadata, and ask ourselves, “What is it that can take us to providing the metadata that will enable our research data to be findable?” In other words, how do we get metadata into a shiny status of metal data? At this point, I really have to thank the ARDC comms team for making these beautiful graphics for me. Look, puns are one of the main methods of learning for me personally, but back to the content of the actual course, sorry.
So in summary, to make data findable, we want to be using persistent identifiers. We want to be describing data with rich metadata, and we want to be ensuring that that metadata are indexed in a searchable resource, like some of the examples that I showed you, okay? That brings us to the end of the findable module. So now I’m going to give you a little heads up on the activities, quiz and community discussions, and then we can get into any questions that you may have.
So the findable activities are to do one or more of the following, read an article that started at all, or you could go and browse the GO FAIR website, which is quite helpful, actually. Number two, create and link to your ORCID profile. Number three, explore some repositories for research data, so that will give you an opportunity to go a little deeper into some of these repositories that I shared with you today. There will be a link to the activity worksheet for you, right about now, in the chat window, I expect.
Moving over here, the quiz, just a note on this quiz, this is for testing your knowledge and what you’ve hopefully gained from our webinars today and on Monday. you will be required to do the quiz for each module to get the certificate at the end, but it will be open until the very end of the course, and you can do this as many times as you like. We’d love to know how you go and how this format is going too, for you, and a few notes about the community discussion.
So check your calendar invitation. If you haven’t received one, or you haven’t signed up, please, please take the opportunity to do that now, or Nichola will assign you into a group by the end of this week. We’re going to run these community discussions on Zoom, so I would encourage you to try for a hard cable or Ethernet connection to your router, if you can, or perhaps use your mobile network, but I know sometimes our connection is patchy and NBN does service updates all the time it seems.
So if you have any issues, please contact Nichola about the community discussions. I’ll put that information in the chat, and you can stay in touch with us via the FAIR one-on-one hashtag on Twitter, or, of course, on our Slack community. Is that all of us?
Oh, and one little reminder for the feedback. Thank you so much for your comments after our first webinar. You’ll have another opportunity to give us some feedback about today’s webinar, and we do read these comments, so we appreciate what you say to us, which helps us make these activities, and this course all the better for next time. Now I believe it’s probably time for… Ah, Matthias, hello. Look at you appearing at the right moment.
Hello. Thank you very much for that, Liz. Two points, two points I’d like to raise. One is, I’m quite jealous of your tie, especially given you’ve managed to color match it to the little A in the ARDC logo. I think I need to step up my sartorial game, and also, we have had appreciation of your pumps, all pumps in general. So please keep them up.
Now, some of the questions that have been asked are sort of organizational in nature about the course. So somebody asked whether all the links will be sent in an email, so the link to the quiz, the link to the activities. So that will be emailed out as well for people who haven’t been able to get on Slack just yet, and if you are having troubles getting onto Slack, please get in touch with Nichola to help troubleshoot that issue.
Now, I accidentally posted the wrong link to SurveyMonkey in the GoToWebinar chat window. Please disregard that link, everybody will be emailed a personalized link that will let us know who it is that’s answering the questions. If you click on the link that I already posted in the GoToWebinar chat, the results will not be associated with you. I’m sorry.
Okay, now actual content questions. So I do encourage everybody to use the question module to post questions, and we have one question here about asking if it would be possible to explain the difference between XML and JSON.
Object? Object notation. Right. Okay. So perhaps I can offer an answer, I hate doing this because I’m almost making it up, that if you have data available in JSON, it enables the machines and any coding or scripts, to do a little more activity on the data than XML, but I will take that on notice and deliver a full report on the Slack channel by the end of the week.
Okay. Now, we haven’t actually received any other questions about metadata and repositories, I’m afraid. Waiting for any more to trickle in, I’ll give that some seconds. Although somebody has suggested a URL, I’ll work out a way to share that one that explains… A web page that explains the key differences between JSON and XML.
Okay, no more questions seem to be coming in, which is actually rather convenient because despite the slight delay to the start, we’ve actually managed to finish on time. You possibly could have gone down some more rabbit holes, Liz, but ah well, next time perhaps. All right, so I’ll hand back to you, Liz.
All right. I don’t have anything more to say. Thank you very much for bearing with us, again, thank you for your patience this afternoon, and I look forward to seeing you all on the Slack and at community discussions next week. Okay.