Presented: 11 May 2020
Presenter: Matthias Liffers
#1 in the 8 webinar series of the FAIR data 101 training webinars.
Good morning everyone. My name is Matthias Liffers, and I’d like to welcome you to this webinar. Before we start, I’d like to acknowledge the traditional owners of the lands on which we all are today. For me in Perth, that is the Whadjuk people of the [Noongar Nation 00:00:18], and I’d like to pay my respects to elders past and present. Another important thing to remember, is that the FAIR Data 101 course is governed by a Code of Conduct, and this Code of Conduct is really important to make sure that everyone has a fulfilling learning opportunity over the next eight weeks. There’s a link there for you to see the Code of Conduct, and if at any time during this course, you observe a breach, please contact us using the form that is linked in the Code.
Okay. Welcome to FAIR Data 101. It’s been a bit of a rollercoaster putting this together, taking advantage of this time. Now I would like to introduce my colleague, Liz Stokes, Hey Liz? She will also be presenting the ecourse work for this course, but will be largely presenting from Wednesday, so today, unfortunately you have to put up with my voice for the next three quarters of an hour.
I’m just going to turn my video off, and to let you take the floor Matthias, but I’ll come back in at question time.
Okay, great. Thanks Liz. And it’s not just Liz and me who are bringing this course to you, there’s a number of other people at the ARDC, who have all been working very hard, and you’ll have an opportunity to meet them over the coming weeks.
Now, what are we going to cover today? First of, there’s a little bit of housekeeping about what to expect from the FAIR Data 101 course, I’ll then be giving a quick introduction or overview to the FAIR Guiding Principles, which the next eight weeks are going to be all about, as well as why the FAIR Principles came about, and then I will start talking a little bit about the first of the FAIR Guiding Principles, namely Findable, and Liz will continue the presentation about Findable on Wednesday. Housekeeping for the course, over the next eight weeks, there will be four modules, each module on one of the four aspects of FAIR, and each module will be over two weeks.
In the first week of each module, there’ll be two 45-minute webinars, and at the end of webinar two, we’ll also give out an activity sheet, which will hopefully keep you busy for around 30 minutes as you work your way through that, there is also a quiz, and you should be able to find the answers to all the quizzes in the webinars, in the activities, and maybe any readings that we give you.
Then in week two of each module, there will be a 50-minute community discussion, and there are a number of options for that, and we’ll be going through those options later. There’s also a Slack workspace available, and you can join that at tiny.cc/fair-101-slack. Now, if you are not familiar with Slack, Slack is a chat tool used by many workplaces around the world, not just workplaces, also projects, and simply groups of people who want to chat with each other, and a single Slack workspace has more than one channel available within it. So when you click on the Slack invitation link and you sign up, you’ll find yourself in Slack, and you’ll be automatically made a member of two channels, #general, and #introductions. Please give an introduction to yourself, in the introduction, say hey to everyone, and then generally, most of the conversation will probably be happening in the #general channel.
If you’re not so keen in keeping Slack open all of the time, which is perfectly okay, you can enable email notifications, for Slack, and in fact, if you go to that URL there, you’ll be able to, once you’re logged into Slack that is, you’ll be able to, change your email notification settings there. And all these slides will be available, after this presentation, so you don’t necessarily have to furiously, write down any URLs that you’d see today.
Okay, let’s get right into the material. The Fair Guiding Principles were first proposed, I suppose four years ago today, not today, about four years ago, when a group of researchers, data stewards got together, and suggested in this paper that was published in Scientific Data, it’s fully open access, so you can have a read of it. And they suggested that simply espousing good data management practices, so simply saying to researchers, and research support professionals, you need to manage your data well without providing any actual detail. The author suggested that, a good data management could be broken down into a number of principles with some clear suggestions, as to how you could fulfill each of these principles.
Now, the Guiding Principles were an evolution of the open data movement, which has been around for longer than the FAIR Principles. But the simply calling for open data all the time was a little bit problematic, which is one reason why these FAIR Principles came about. So the FAIR Principles are much more nuanced than calling for open data. So in the next module, you will learn more about the accessible component of FAIR, which is more along the lines of as open as possible, but as closed as necessary, but making sure that data is available and accessible somehow, even if it’s not openly available. The FAIR Guiding Principles are, well, there’s quite a lot of words to them, but they are quite clear in suggesting the best practices for making data FAIR, and is certainly a lot more useful to your average researcher, and simply saying to them, “Make your data open.”
And also another thing that the FAIR Guiding Principles really talk about quite strongly, is that data shouldn’t just be FAIR for humans, it should also be FAIR for machines, because as we find ourselves in an age, when research is becoming more and more computationally intensive, it is important for machines or computers, to be able to access data as cleanly and seamlessly as possible. So that the computers can just get on with their number crunchy, without humans having to find the right data set, download it, upload it, change it to a different file format, what have you.
So what exactly are the four FAIR Guiding Principles? We have Findable, Accessible, Interoperable, and Reusable. So four principles, four modules in this course, and today and on Wednesday, we’ll get really deep into the Findable. The Principles, to paraphrase them a bit, or maybe expand on these words, firstly, Findable. Data used in research should be findable, somehow. You should be able to work out that it exists, and you should also be able to work out where that data might be. And there’s a number of ways in which you can, not necessarily ensure, but certainly you may take as many steps as possible, to make sure that your data is findable.
Secondly, data needs to be accessible. So once you’ve found where the data is, or even found that the data exists, you need to be able to access that data somehow, and it could be that the data is fully open, and if it’s the case downloading the data to your computer, and working on it that way, or it might be that there needs to be some kind of mediation, because it’s not always appropriate to make data fully open, so you might need to contact a human, to get permission to access that data. And the Accessible module, we’ll talk about that as well as different ways in which data can be accessed in technical terms.
Thirdly, data should be Interoperable. And the Interoperable is possibly one that, people can struggle with the most. What does Interoperable mean? What Interoperable is trying to achieve, is that you get two data sets that were collected, in hopefully a nice systematic way, and it is relatively straightforward for you to be able to combine them together, or use them together, or perhaps you have a data set that can be analyzed in one piece of software, but it can also be analyzed in another piece of software without much work, because that data has been recorded in a systematic, and standard way. We’ll get more into that in a few weeks. And finally, Reusable. Once you’ve found and you’ve accessed your data, or even once you’ve produced your own data, hopefully that data can be made reusable, to maximize on the investment made in collecting or creating that data. So it’s something to look forward to in about seven weeks time.
Now, I apologize for having a wall of text, but this is something I occasionally have arguments with people about. So the FAIR Guiding Principles, which many people often call the FAIR Data Principles, weren’t originally coined to be just about data. The authors, Wilkinson, et al, intended that these principles be applied to everything that leads to the data. The algorithm, the tools, the workflows, the software, or procedures, processes, and all scholarly digital research objects benefit from the application of these principles.
Now, in the four years since the authors came up with the Guiding Principles, it’s become apparent that some research objects, need a little bit more… Well it’s not quite as straightforward to work through the FAIR Principles, to make them FAIR. But in some cases, software for example, there needs to be a little more thinking, or a little bit of different thinking. Because sure, software and data are both digital objects, but the way that computers use them and interact with them, and even the way humans interact with them, is different. Therefore, the principles possibly needs a little bit of a revision, and we’ll be talking about these kinds of things in future weeks as well. So that being said, a very big but, is that largely, we’ll be talking about FAIR data over this course, except when I occasionally let myself be distracted.
Okay. So, Findable, F. What does that mean? To expand on that a little bit, metadata and data, should be easy to find by both humans, and computers. Now the Findable has been broken down further, into four more specific guidelines, and I’ll be talking about one of those today, and then Liz will be covering the others on Wednesday. So here we go. F1, and this is possibly, I mean, it’s the first of the principles.
So arguably the most important of them all, and many other principles tie into this one, in that metadata are assigned globally unique, and persistent identifiers. Sorry, metadata and data, which brings me onto a quick note, when these principles, ope, sorry. This principle, metadata or data, when we’re talking about that, we’re saying that, or rather we suggest it is acceptable for data, and its associated metadata to have the same PID, persistent identifier.
So there’s no need to create one PID for the data set, and then another PID for the metadata, and this we’ll talk more about how PIDs and metadata interact. So, I said a globally unique, and a persistent identifier. So what does that actually mean? Here’s an identifier. I’ve forgotten how many characters it is, 13 characters in hexadecimal. So each of those 13 characters has 16 different possibilities, and I did the maths a couple of weeks ago, and I think that’s about 500 million billion different possibilities. So looking at this, with any randomly generated 13 character hexadecimal, there’s a pretty good chance that it’s actually going to be a unique, and it’s just going to take quite some time for the same randomly generated number to come up again. However, in and of itself, it’s not guaranteed to be globally unique, and so what do I mean by globally unique? It’s like a snowflake, a PID, a globally unique and persistent identifier, PID for short, does really absolutely have to be globally unique.
And the reason for that, is so that we can be guaranteed that it is absolutely unique to a particular data set, or other digital research object. It is incredibly important, when you’re trying to tell a computer about a data set, because computers really, when it comes down to it aren’t very smart. They are quite fast, and they can process information much faster than a human can, but when it comes to things like, looking at context and making, what you could call an educated guest, computers aren’t that great. Unless you’re talking about machine learning algorithms, which we aren’t, so let’s just stick with that. So we really, really want our identifiers to be globally unique. That is to say, in the entire world, a particular identifier can only be assigned to one data set.
So how can we do that? There is consensus, well pretty good consensus as to how we achieve that these days, and that is in the form of a DOI. Now, hopefully everybody is familiar with the concept of a Digital Object Identifier, they’ve been used for quite some time now when it comes to, uniquely identifying research outputs like journal articles. And over the past, actually it’s probably been close to 10 years now, they’ve been applied to data sets as well. And in fact, it’s the IRDC, and one of its precursor organizations, the Australian National Data Service has been providing a DOI minting service, for Australian research organizations. So this DOI for example, is from that service, and combining a few different elements together, so you’ll recognize the number at the end there, because that was the identifier I posted before, but this time we’ve added a little bit more information there, to ensure that it is globally unique. So, first up, we have this 10, or 10. And for people who are familiar with PIDs, or even DOIs, this 10.something, will stand out as being a DOI.
Now, you might also sometimes hear about persistent identifiers called handles, for people who like getting technical DOIs, are a kind of handle, but they’re very specifically handles that have this 10. prefix. After this DOI handle, we have a few more digits, so 4225 says that that is a DOI generated by the ARDC, don’t worry, this one’s not in the quiz. You don’t need to memorize this. The 06 means that it is a DOI in Curtin University. And then that number at the very end, you could consider to be a local identifier. So that is to say it is definitely unique within the local university context, because the system makes sure that if it does randomly generate the same number again, that it won’t use it again, because it’s already been used. But by combining all of those elements together, we come up with this globally unique identifier.
Now I’ve just realized I’ve been laboring on about globally unique for quite some time, but, we also need to talk about this idea of persistence. So the identifier needs to be persistent in the same way as some of the smells that come from my dog. They need to hang around for quite some time. Now that’s not necessarily forever, because forever is a very, very, very long time, and it is quite unlikely, that the DOIs that we’re minting today, are going to still exist in a billion years, or indeed whenever it is that the sun expands and consumes the earth. However, persistence means that the identifiers will continue to be available, and we’ll continue to reference the same material, for the foreseeable future. And in fact, as long as is necessary, which requires a couple of elements in and of itself.
So to guarantee the persistence of an identifier, you need to make sure that the infrastructure, is there to handle that. But at the same time, you need to have some governance behind that infrastructure, or on top of that infrastructure, to make sure that the infrastructure is being managed and looked after. I’m not sure how many of you have had experiences with systems coming online, being made available, but then nobody had the resources, generally it does come down to resources, to maintain that infrastructure, and keep it going for as long as it’s useful. So we come up with websites that become outdated, or databases that stop working because the software has moved on, but that original database would require too much effort or more effort than is available, to update and bring into the future, or indeed, just to the current times.
So when it comes to selecting a PID for your data, or your research object, it’s good to find one where, you stand a good chance of it being both globally unique, but also persistent, so that the DOIs, or the handles or the ORCIDs that you create today, will continue to work for as long as you need them to work. Now, at the ARDC, we absolutely do recommend particular PIDs for particular kinds of research objects. So for example, when it comes to PIDs for humans, so a researcher, an author, somebody who contributes to research and generates research outputs, there are a few different identifiers available, but really when it comes down to it, there is one de facto standard, although you could possibly say now it’s more than de facto because many, many organizations around the world have adopted this as their chosen identifier.
So we have some researcher ID and the author ID. Which are both owned and controlled by for-profit corporations, which look, there’s nothing wrong with that. We all use infrastructure that is created by for-profit corporations, but the identifier that the ARDC really does recommend, is the ORCID, the Open Researcher and Contributor Identifier, which is owned and controlled by a member-based organization. And the reason why that’s an advantage, is that say a university can say, “All right, we’re going to ask all of our researchers to create an ORCID, but we’re also going to become a member of ORCID, to help make sure that that governance persists as long as the ORCIDs need to persist.” Who knows maybe in 20, 50, 100 years’ time, there’ll be some better way of handling this kind of identification, but for now, for the foreseeable and workable future, any organization can join ORCID, and any organization can become a part of that governance, that guarantees the persistence.
And here in Australia, there is an ORCID consortium, that is led by the Australian Access Federation, most Australian universities are a member, so I strongly recommend you go and look at the AAF’s website, and learn more about that ORCID consortium.
So, one PID recommended out of a bunch of PIDs available, just for one kind of thing involved in research, I think humans. Humans aren’t things. So one of the components of research, there are lots of different options of PIDs, but when it comes down to it, there’s this particular one that we recommend. And the same goes for lots of other different kinds of components of research. So for people we’ve got ORCIDs, but then we also like to identify, or we can, and possibly should identify projects, digital objects, physical objects like samples, equipment, there’s lots of different components of research. And so FAIR Data 101, why are we now talking about PIDs for all of these other different things? A PID helps you unambiguously identify a data set, but it’s also probably quite useful to unambiguously identify, the humans who were involved in creating that data set.
Now, I was thinking of putting my own record up, but I discovered I’m the only Matthias like this in the world, so I’m possibly a poor example of that. But by attaching an ORCID to researchers, you can unambiguously identify that person. But you can also create a machine readable link, between a data set and a human, and between humans who have worked on the same data set, or the same paper, and that machine readability is really useful. Because if you ask a machine a name, it’s really hard to understand. Now we’ve also got identifiers projects, so we can make sure that particular projects are linked with the data and the publications and the humans. And that’s actually quite useful for the management of research as well. So, the Australian Research Council, which funds most of the research in this country, can have identifiers on projects, and know what’s coming out of that project, that has been made FAIR and available.
And then data sets, physical objects, equipment, I could go on, but I won’t, because we only had limited time today. So what are some of the PIDs that we do recommend and are available? I’ve already spoken through that one. For people, we strongly recommend, everyone has an ORCID, and in fact that is one of the activities for this module. If you don’t already have an ORCID, please go and create one. It takes only a couple of minutes, and it’s completely free of charge. For projects, there is the RAID identifier, the Research Activity Identifier. The ARDC was instrumental in getting this one off the ground and in fact it still is. So go to raid.org.au, to learn more about those. For digital objects, like data sets, like software, we recommend the good old DOI, and the ARDC has a minting service for that.
But then when you’re looking at physical objects, like samples, there is the IGSN which, is unfortunately named International Geo Sample Number. It arose, when some geologists wanted to be able to unambiguously identify rock samples, or mineral samples, or geological samples, should I say? But has since begun to branch out into other domains as well. The IGSN is built on particularly stable infrastructure with good governance, which is why many other disciplines are looking into using it too. And then when we’re looking at an emerging area, that is identifying equipment uniquely, nobody has come up with a standard yet. That is to say, there’s no single recommended identifier. Some organizations use handles, some organizations use DOIs, but then if you’re interested in that kind of thing, there is a group, The Identifiers for Instruments in Australia Group, and you’re welcome to join them, and attend their meetings and chat about identifier as the instruments.
Now I thought I’d show you a couple of examples of strangely non-data identifiers, although this does link through the data. This for example is an IGSN. On a sample, that’s from a project I worked on a few years ago at Curtin University. And you can see the QR Code on the sample there, and if you quick, you can pull out your phone, and you can scan that QR Code and touch wood, your phone will bring up that metadata page on the right.
So what’s that IGSN there is enabling, by way of this QR Code, is being able to pull out a physical sample, and this physical sample has been prepared for use in an instrument, which is why it’s embedded in epoxy resin, you can scan that with your phone, or whatever QR code reader you have, and have them metadata record that gives you all of that rich metadata, which Liz will be talking more about on Wednesday, to let you know exactly what that sample is, where it came from, who was involved, and also which data sets were collected by the analysis of that sample, which is really quite exciting.
Linking everything together, makes it much easier to find things. Otherwise, what you’d be faced with, is a drawer full of these rounds, as they call them, with some kind of random number scratched on them, with a compass or a sharp tip of some kind, and that would then have to be matched up with somebody’s spreadsheet somewhere, which could make reusing, samples a real nightmare. And when you think about how much it costs to send a human out into the field, to collect samples, and bring those samples back, and process them and create, rounds, or mounts, or something that can actually be analyzed, this would really help, in cutting down duplication of effort.
Now, another favorite example of mine, some work undertaken by The University of Western Australia, and that is around uniquely identifying equipment in the Centre for Microscopy, Characterisation & Analysis. So the CMCA is a facility full of all sorts of different instruments, largely used in the life sciences, but lots of imaging instruments, or analytical instruments. And what we have here is the Bruker Avance III HD NMR spectrometer. I have to admit, I don’t actually entirely know what that is or what that does. But we have two different spectrometers here. One that operates in the 600 MHz frequency, and one that operates in the 500 MHz frequency. And it could be quite easy to get those mixed up, because there’s only one character difference between the two, and that could escape a casual inspection.
However, UWA has minted handles, for all of the instruments in the CMCA, and therefore provided those with unique identifiers. And you can see, if you were to go to the UWA Research Repository, you could find those instruments, and their metadata records there, as well as information on how to use those instruments. And the handles are there under the contact information in links, you’ll be able to grab the handle from there.
Now I’m moving on because I am now running into question time, what’s next? So, after this webinar is finished, my colleague Nicola, who, you should have already been emailed by last week, we’ll be sending out a link to sign up for the community discussions. Now, these community discussions are really quite important, a very important part of the FAIR Data 101 course, because it lets you connect with your colleagues, also during this course, and discuss the material. Either the webinars, you can tell them how fine my beard is looking today, or maybe actually talking about something important. And there are three different time slots available, and I’ve given those in Australian Western Standard Time, Central Standard and Eastern Standard Time. You’ll be given the link, but you can sign up for one of those time slots, and you’ll be attending the same time slot every fortnight.
So that’s four community discussions at the same, time on the same day, each fortnight. There are limited numbers available, so if you want to be sure to get the time slot that you want, then make sure you’re open that email message sooner rather than later. And then also on Wednesday, my colleague Liz will be presenting on part two of Findability, talking more about metadata. And that will be on, set already on Wednesday. And Liz we’ll also talk a little bit more about the activities that we’ll be sharing with you, and the quiz. So that’s it for me. It is now question time. Okay, Liz…
How are we going with the questions?
Pretty good. We have got a couple of ones at the moment that I’d like to draw your attention to. Francis asked, a question about, minting DOIs. Does it matter where the DOI is minted? For example, by your Uni or buy Zenodo? What choices, or options do people have?
To be honest, when it comes to picking a DOI minting agency, it actually doesn’t really matter. It’s largely up to your organization, and which service they would prefer to access. For example, the ARDC does provide free of charge, a DOI minting a service, and that DOI minting service works through an international organization called Datasite. And by going through datasite, which has good governance, good infrastructure, we make sure that the DOIs persists that way. However, some organizations might have figshare for institutions, and figshare has its own method of generating DOIs, which also coincidentally happened to be through Datasite, but they are not using ARDC infrastructure at all, and that is absolutely okay. DOI, without getting too deep into how to DOIs work, but all DOIs are registered centrally by some central DOI infrastructure, and they are so… it’s probably bad to say too big to fail, but so thoroughly important, they’re all going keep on going.
And even if the agency that you used to mint a DOI stops creating DOIs, who knows, maybe figshare, will decide to go with a different kind of identifier, unlikely but maybe, existing DOIs will continue to work because of that central registry, that all DOI and minting agencies going through.
Great, thanks Matthias. There is a question, asked by Jenny about something being based in WA, and I believe it might be the IGSN, or it might’ve been the example you were talking about. But I did see another question and this may actually be an answer from Rebecca that, whatever it is you were talking about does indeed work, and it is based in WA. So there’s my surmising.
Yes it works. I think that would be the IGSN, QR Code, which I said, I wasn’t sure it would work. But it does which is good, which means that the infrastructure and the governance is working.
Awesome. Ah, here’s another question from Jean. Is there any guidance on whether to mint DOIs at the collection level, at the item level, or both? Does it depend on the data set and how it’s likely to be used?
Oh, well maybe you can use community discussion to have this kind of philosophical discussion, but even within the ARDC, we still discuss about the idea of the data, set versus a data collection, what does that mean? How do you treat them? I mean, you can almost consider it to be kind of like an encyclopedia. So you have an encyclopedia multi-volume encyclopedia, remember those? Do you catalog the encyclopedia as one thing, or do you catalog the individual volume? Once you catalog an individual entry within the encyclopedia, how do you treat that? And it really comes down to what you think is the most manageable, and what makes the most sense. Is it likely that other people trying to reuse your data, or cite your data, who could cite one particular file for example, or are they more likely to cite the entire collection as a whole?
Based on my personal experience, so data projects I’ve worked on in the past, we have, created metadata records and minted DOIs for an entire collection, but then also for each data set within that collection, we’ve created yet another metadata record, and yet another DOI. And we’ve semantically linked them, so that they’re all related, so that [in 00:39:49] use that collection record, and you see all the data sets within the collection, and you look at the data set, and you see all of the other data sets that are within the same collection.
Nice one. I guess you could also add that as DOIs are often versioned by their minting apparatus, or infrastructure, that that might weigh into your decision about at what level you put the identifier at. I’ve got another question Matthias, I think this is about the IGSN, people asking if you know if they use PPMS software, to manage their research instruments.
That’s possibly more about the instrument handles from CMCA, I’ll have to take that question on notice, unfortunately I do not know, but I can find out, and we should have a record of this, or rather, we will have a record of all of these questions, so we will be able to follow up, without any further action by the person who asked it.
Okay. There’s another question here from [Susinkanal 00:41:08], are exceptions to Findable records, managed through individual institutions?
Exceptions to Findable records. Well here we go into another, I mean before we even talk about managing exceptions to Findability, I would ask why would you not want something to be Findable? Now, we’re not suggesting that everything that is Findable is also immediately open and available and downloadable. And so, first decision, but then it really is up to the institutions, or whoever is undertaking the research, to decide whether their things, they’re objects are made Findable. However they might run into policies set by funders, by publishers, or even by their own organizations. I haven’t quite answered the question there, but I think this could make an excellent, discussion topic in the Slack workspace.
Matthias, I’m just noticing the time, and I believe we have reached the limit of our webinar today. So I am going to bow out, and hand back over to you, and just put a little message out to say, thanks everyone who has managed to join the Slack today, while Matthias been speaking. It’s really great to see some introductions there, so keep them coming, and over to you Matthias.
Great. Thanks for that, Liz. Possibly I haven’t been able to answer every question, I’m sorry, I wasn’t able to make it to yours, or if I didn’t know the answer to your question. As I said, we’ve got the Slack, feel free to post your question in the general channel. Otherwise, we have a record of all the questions, and hopefully we’ll be able to address those, sooner rather than later. I will sign off now, I’m getting a little pekish, even though it’s not quite lunchtime for me, and I look forward to seeing you all on Slack, at the webinar on Wednesday with Liz, and/or at a community discussion next week. Thank you very much.