Presented: 9 June 2020
Presenter: Matthias Liffers
#5 in the 8 webinar series of the FAIR data 101 training webinars.
Good morning everybody, welcome back to the FAIR Data 101 webinar series. I would like to begin by acknowledging that traditional owners of the land on which we all are today, for me in Perth, that is the Noongar Wajuk people. I’d like to pay my respects to their elders past and present. Thank you for coming back on our irregular day, I hope everybody who had one enjoyed their public holiday yesterday, and you’re all bright eyed and bushy tailed to learn a bit about interoperability.
So this is the first of two webinars on interoperability, and the second one will be delivered by Liz Stokes tomorrow at the normal time. So I quick reminder, this course is governed by a code of conduct, and if you think … sorry, if you observe any breach in the code of conduct, could you please let the ARDC know via the link in the code of conduct itself.
And so today as I said, I’ll be talking about interoperability. Interoperability in terms of FAIR is a bit tricky to talk about, and I will get into why that is over the course of this talk. This talk will not be as technical as mine was two weeks ago, but it should give you a good overview of what the I1 through … sorry, through 3, not just 1 and 2 guiding principles mean, and how it is that we can try and address those to the best of our ability. So let’s get right into it.
So my question is what’s the most expensive mistake you’ve ever made? In the late ’90s there was a pretty big mistake made by NASA and one of its subcontractors. So this here is the Mars Climate Orbiter, sorry, an ill-fated satellite that was sent … or spacecraft that was sent to Mars and was intended to relay messages from various other instruments on and around Mars, back to Earth. When the Mars Climate Orbiter got to Mars, it was meant to perform a series of braking maneuvers.
So this is a pretty standard thing for spacecraft that are trying to orbit Mars, they will go around it in big loops that get smaller and smaller and smaller, and the idea is that the spacecraft will go through the very upper atmosphere and use that to slow itself down, because it tries to get from Earth to Mars as quickly as possible, but that’s too fast to maintain a stable and close orbit to Mars, so it really needs to slow down to bring itself closer. Now, the problem what happened, is that, well, NASA lost contact with the Climate Orbiter.
Nobody knew right away what was going on, and then eventually they worked out what the problem was. So the issue was between two components of the data processing software that was used to control the Climate Orbiter. One piece of software was written by NASA, and the other piece of software was written by a subcontractor, and they were meant to exchange data, or rather, the subcontractor’s component was meant to send data through to NASA’s component.
NASA’s component was expecting to receive measurements in the metric system, so I believe it was units of pressure in the metric system, and the subcontractor’s component sent it through in imperial units instead, so the subcontractor and NASA, they both made assumptions about what kind of data was being transferred, and therefore the software they build also made those assumptions. This is really what kicks off the first of the interoperability guiding principles, in that metadata and data should use a formal, accessible, shared, and broadly applicable language for knowledge representation.
Now, this statement is quite … I suppose you could interpret it in lots of different ways if you really look closely at it, there’s lots of words, but when it gets down to the nitty-gritty, what they’re actually looking for … what they’re suggesting, sorry, the original authors of the paper, and what this principle is after, is to try and address issues like this. So what we have here is a table of observations, we have a date, and we have a temp, and under temp is a series of numbers.
Now as a human, I look at that data, and I go, “Okay, we’ve got a series of dates, and I’m going to assume that’s temperature.” So there’s the first assumption, that that is temperature. The second thing … my second assumption is that that date is written for me in Australia as day, month, year. So what we have is a series of observations made from the first to the sixth of January in 2020, then for the temperature I actually have no idea, there are three different possible units of measurement that I certainly know of with temperature, we’ve got Celsius, we’ve got Fahrenheit, and we’ve got Kelvin.
And they would be widely different values depending on which unit, so you know, if all of those are in Kelvin, then those temperatures are quite low, if those temperatures are in Celsius, those temperatures are really high, and so what would be really, really nice, is if the data told us what it contains. So it’s using a particular schema here to record values and present them to us, but we don’t know what that schema is, so we can improve this dataset a little bit, and we give a bit more context.
So now we know with that information, the dataset is telling us, okay, the date is … my assumption was correct, that the date is from the first to the sixth of January 2020, because of course it could have been recorded by somebody in the United States, in which case it would have been the first of January, the first of February, first of March and so on, and the temperature there, we now know is in Kelvin, so they’re cold, they’re low temperatures, below freezing point, not high temperatures well above boiling point.
And the reason why this is important and is especially from a machine readable point of view, is that when it comes down to it, computers are not very smart, and even when you think about it, as a human looking at that dataset for the first time, I was making these big assumptions about what the data contained, and I had to be sure that my assumptions were correct if I wanted to process that data and use it effectively, and computers aren’t any different, they accept they can’t actually assume, they require a human to make assumptions on their behalf to program them in a certain way to expect certain inputs.
And if a computer gets inputs that it doesn’t expect, and it hasn’t been taught how to handle those incorrect inputs gracefully, the computer crashes or fails or stops working in some way, or possibly a little bit worse, it keeps working, but in really unintended ways. So what can we do about this? No sorry, before we get to what we can do about that, here is a quote from the original FAIR guiding principles paper, saying why we really want data to be able to communicate about itself, because every time a new data type is created, or a new way of recording data is created, somebody needs to write a piece of software that can interpret that data, and that’s a parser, something that parses the data and makes it available to a program.
And any time somebody creates a new data format, we need a new parser, and that parser will often only work with one language, or they’ve written that parser in a particular language, say within Python. If somebody then wants to use that new data type in R or in a C++ or in Fortran, they need to write a new parser for that as well, and there are literally hundreds of programming languages, if not thousands, and anybody who uses one of those programming languages, wanting to use this particular data type, would need to have a parser available to them, and when it comes down to it, given limited time, limited human resources, it’s simply not sustainable to keep writing new parsers for new kinds of data or new formats of data.
So what would be really nice is if data used a standard way to describe itself to whoever was trying to access it, or to the piece of software trying to access it. You could say that there’s some speed dataing going on, sorry I apologize for that pun, Liz egged me on, I normally don’t like puns at all, I’ll put the blame firmly on her. What we really want to do, is we want to determine the attributes of data, rather than assume the attributes of data, because if we determine those attributes, we then know what valid inputs are, we know for example that temperature in Kelvin can never go below zero Kelvin, and in Celsius we know that temperature can never go below -273 degrees Celsius.
So if a dataset told a piece of software, “Hey, this data field, it’s temperature, and I’ve measured it in Kelvin, and by the way one of my temperatures is -2,” the software will know that that piece of data is invalid, and it will be able to handle it gracefully rather than falling over and creating unexpected outputs. Now, unfortunately there aren’t actually too many different self-describing data formats that are in common use around the world, especially for science. What I’ve got here is two examples, we’ve got DDI, the Data Documentation Initiative, which is an organization that has a couple of different formats it’s working on.
And there’s also netCDF, the Network Common Data Format. So first up, DDI, the Data Documentation Initiative, is has a data standard called DDIC, Data Documentation Initiative Code book, and what DDIC does is allows somebody to describe a data collection instrument for the social sciences, which is generally a survey. So for those who aren’t very family with surveys, a good survey will have a code book created for it, and what the code book describes is each question in a survey, what the possible values of answers for that survey question are, so is it a multiple choice survey? Is it a Likert scale where people indicate, you know, one to five how … sorry, pleased where you with the service you received today?
Or is it something more open-ended? And a fully developed code book will possibly also record the actual responses to the survey in them, so you can download … and sorry, these code books originally human-readable code books, so you could grab a document, it has all the questions, all the answers that were given, you know what the possible responses were, you know, how many questions were asked and things like that. What DDIC lets you do is record all of that in an XML-based language, which means that you can create a code book in a standard file format in DDIC using XML, and by describing the data that it contains, the survey responses, but also the questions and the valid answers to the questions, any piece of software that understands DDIC can pull your code book in, understands all the questions, it knows all the valid responses, data types and things like that, and you can work with it that way.
And so that standard’s actually been around for a very long time, it was first developed in the late ’90s I believe. Now that’s not the only thing that DDI do, so DDI also has a different metadata standard, or data standard DDIL lifecycle and this is certainly from my point of view, a very complex data metadata standard that lets you track and describe research data throughout its entire lifecycle. It’s something I definitely need to learn more about, but that’s out of scope for this webinar.
We then also have the Network Common Data Format, like DDIC, it’s been around for a very long time, since the late ’90s, and what the Network Common Data Format does, is it lets you create a self-describing dataset that contains geospatial data. Now netCDF first came out of atmospheric research, but these days it’s also being used in other disciplines that need to store geospatial data and like to use self-describing datasets, so for example the AODN, the Australia Ocean Data Network, it is a fantastic source of data about Australia oceans, all sorts of readings, salinity, temperature, anything you can think of that’s been collected by instruments that … they stay still, or they might be ocean gliders moving around, or data from boats.
So it collects and presents data from the Integrated Marine Observing System, IMOS. Now the AODN makes all the data on it, or stored in it, available as netCDF, now on top of that, they also have a bunch of web services, so this is what I was going on about in my last webinar, Inaccessible, AODN has made their data available through a series of different web services, different interfaces, protocols, communications protocols, and you can use any one of these, or several of them to get data as netCDF, and then your data is not only machine accessible, but also machine interoperable.
So they’re doing some really good work when it comes to having FAIR data. All right now the problem with self-describing data, or certainly these are some of the problems that I perceive, you might disagree, feel free to do so in the questions or during our discussions next week, or in the Slack as well. These data formats, certainly DDIC and netCDF have been around for a very long time, so they’re old, but at the same time, as researchers in domains that are no geospatial based, or you know, not atmospheric, or they’re not in the social sciences, struggle with this idea of interoperability.
We have these examples that are great for a couple of disciplines, but they’re not necessarily that useful for other disciplines in the state they are now. Now I’m sure work is being done to make these standards more viable for other disciplines, but because the concept of self-describing data is so new to so many people, trying to understand these very old formats with their very long legacy, can take a lot of brainpower.
Also these self-describing datasets, they’re not as straightforward as a spreadsheet, and Excel remains, to this day, the most popular data processing and data analysis tool used in the world, and it is unlikely to be under-seated any time soon, and by default, these data standards aren’t necessarily compatible or usable or something like that in Excel, although why you would want to edit a survey code book in Excel I don’t understand, but there are tools available that lets you use both DDIC and netCDF, but then it might be slightly harder to access, or slightly harder to learn, or take more time to learn than Excel.
All right, moving on. So, principle two, data and metadata use vocabularies that follow the FAIR principles. So, this one I actually think is relatively straightforward, or at least to me, but I have the privileged of coming from a librarian background, so this idea of vocabularies isn’t very new to me. So for those who still would like to learn a little more about vocabularies, I’ll throw up an example. Okay, here’s some more fake data. What we have is we have some dates, okay great, this time we know that it’s on the first to the sixth of January 2020.
I’ve got a species written down, magpie, okay great, and then number, one two three two one two. My assumption, here we go again assuming, not determining, maybe this data could be a little more interoperable. I’m assuming that somebody sat down on a series of days and counted the number of magpies they saw. Fantastic. Now, when I say the word magpie, what comes into your head? Is it the bird on the left, a European magpie? Or is it the bird on the right, an Australian magpie? Because if you’re in Australia and you’re observing a European magpie, which is a species of corvid, so quite closely related to crows and ravens.
So if you’re in Australia and you saw one of those in the wild, might be cause for alarm, how did that escape from wherever it might be? And conversely, if you’re in Europe and you suddenly heard the dulcet tones of an Australian magpie warbling its call, or possibly hearing the sound of a magpie trying to swoop you, again, a cause for concern, how did that Australian magpie get to Europe?
Australian magpies are also not actually very closely related to crows and corvids, they’re songbirds, passerines, so the magpie on the left, the European magpie is more related to our Australian crows, than the Australian magpie is to Australian crows. Okay, so now there is thankfully a well established method for uniquely identifying species. So there we go, we’ve swapped out the common name, magpie, with the species name. I’m probably going to make a hash of this, Gymnorhina tibicen, and that is the species of Australian magpie.
So by swapping from the colloquial name of magpie to the established vocabulary of the binomial classification system for naming species, we know without a doubt that we’re talking about Australian magpies here. Unfortunately there are nine subspecies of Australian magpie, but I will not get into that, so I knew there would probably be some birdwatchers in here trying to say, “Hey, there’s more than one subspecies and it does matter which one,” but here we are, I’m acknowledging that they do exist. So … sorry, I thought [inaudible 00:21:49] jump off onto that, so the idea is wherever possible, rather than coming up with your own vocabularies to describe things, so let’s say you’re describing colors of something, rather than defining your own series of colors for people to choose from, consider finding a pre-established list of colors or that others might also be using.
The real benefit of that is that it lets you, for example here, compare similar data with similar data more easily, so if I was trying to compare observations of Australian magpies, and I found this dataset, I wouldn’t know whether it would be useful to me, but I do know that this particular dataset would be useful to me when comparing Australian magpies. Liz will be going a bit deeper into vocabularies tomorrow, including some places to find vocabularies. Okay, metadata, sorry for skipping ahead a bit too quickly, include qualified references to other metadata. Now, this, like the first interoperability principle, is a situation where I think that the principle is quite aspirational, but in terms of technology and understanding, and culture change, we’re probably not quite there yet.
So what this guiding principle is trying to ask us to do, is to make all data linked data. Okay, so here comes a very quick primer on linked data. So the idea is in linked data, everything is described as a triple, and that triple consists of three pieces, it’s triple. There is a subject, there is a predicate, and there is an object. If you’re a big fan of grammar, you might already know what I mean here, but I think I had to look up predicate myself, but here’s an example of subject, predicate, object. Matthias is employed by the ARDC. So I am the subject, the predicate defines my relationship to the object, which is the ARDC.
You can also flip that around and you could say ARDC as the subject, employs as the predicate, Matthias as the object. I’m fine being an object, yeah, it’s something you have to live with occasionally, and then we can also say the ARDC employs Liz, and that back reference Liz is employed by the ARDC could also be made, and then Liz and I could also have a relationship, Matthias works with Liz, or, Liz works with Matthias. So the idea of linked data is that everything has a link to something else. Now, not everything has a link with everything, that’s a bit meta, everything has a link to something else, even if it’s only one thing, there are very few things that exist in total isolation except those of us who have been working from home through the coronavirus.
So, Liz works with Matthias, as I said. How can we do this without data? So to really fully follow principle I3 we would turn every single observation into a triple of some kind, or even … what you can see here in this table is actually several observations at once, so on the first day we recorded one observation of the Australia magpie, and on the second day we recorded two observations of the Australia magpie, so there’s actually two separate observations there. Might be a little bit better if we had a time as well, but that can’t be helped with this example. So what we would like to do is say something like, “An observation is of a bird,” so this observation is of the Australian magpie, this genus and species.
Now with linked data, what linked data really, really, really wants is for things to be linked via their PIDs
so back in my description of Matthias works with Liz, we would really like to use PIDs for Liz and me, so we could possible use our [inaudible 00:26:52] so that my [inaudible 00:26:54] works with Liz, and we make that assertion. Now this is unfortunately where things start getting a bit tricky for linked data, and that is, what is the PID of a bird? So we’ve got our Australian magpie, or all our subspecies, how can we … sorry, how do we find the PID of that bird? Is there a PID for that bird?
Maybe somebody’s put a DOI on it. So I tried to find some PIDs for the Australian magpie, and on the very trustworthy and high-quality source, Wikipedia, I found this list of PIDs at the very bottom of the page about Australian magpies. What we have here is many different possible PIDs for the Australian magpie, many of these I’ve never heard of before, I do know of Wikidata, Wikidata’s a great project, out of scope for this talk, sorry, but have a look into it, and there’s also the Wikispecies database, so somebody was trying to build … or a group of people, sorry, a community is building a database about species and giving them PIDs of some kind, but unfortunately we have several other initiatives that are also trying to assign identifiers to birds, and so you might ask, “All right then, we’ve got all of these identifiers, why don’t we try to build some kind of new identifier that actually then unites all of these identifiers?”
And then, a talk about standards is not a talk without an [inaudible 00:28:45] comic. So oftentimes the reason why we have so many standards is because there are a number of competing standards, somebody says, “Oh my God, there are too many standards available, what we need to do is we need to combine, we need to create one standard that covers everybody’s use-cases.” And then so we have 15 competing standards, and then a new use-case comes up and you know, it’s standards all the way down. So there is a bit of work to be done by the community in working out how to address some of these guiding principles, so the DDI and the netCDF, fantastic exemplars, and in fact there will be a webinar shortly, I think it’s in the next week, about how DDI is working on more cross-disciplinary initiatives, and we will share the registration URL, in fact sorry, it has already been shared.
So sign up for that if you’re interested, I’ll certainly be watching that one myself to learn more about that. Now I’ve been talking about linked data, and some of you might have heard of linked open data, so I thought I’d quickly mention what the distinction is between the two. So Tim Berners-Lee, when he came up with this very pithy description, “Linked open data is linked data which is released under an open license, which does not impede it’s reuse for free.” So we’ll be talking much more about licenses and reuse in the week after next, but the idea is that linked open data is linked data, but the key thing differentiating it is the openness, and how available it is for others to reuse.
Now I think I’m close to running out of time, in fact I have run out of time for speaking, so I will ask if Nicola is around, because Nicola will be facilitating our Q&A session today.
Okay now in order for me to do this, I’m afraid I’m going to have to ask Ash to make me an organizer so that I can see the questions, our go-to webinar threw a small tantrum, so we’ll just give that a moment.
Yeah thanks for posting those links in the chat while we were-
While I was presenting. Okay it looks like you’re an organizer now.
Yes, I think it is just rebooting for me. Sorry everybody. Aha, and we’re off. Questions, so we don’t actually have any questions at the moment, so if anyone has any questions, we do have a few links that were posted by Steve and by Catherine Howard, but nothing for Matthias to answer. Does anyone have any questions for Matthias on this topic? I know it’s quite a complex one, so maybe not one where an easily formulated question comes to mind, but if there are some harder to formulate questions that come to you later, then we will be monitoring the Slack so we can get into some more nitty-gritty difficult discussion there.
We will also be asking the hard questions during the community discussions next week. So as with the previous modules, we’ll have the questions and activities available for you … oh sorry, the quiz questions, the activities, and the community discussion questions available for you later this week after the business webinar. Have I killed enough time Nicola? Any questions come in?
No, no questions yet.
Okay. All right.
Well I mean we … aha, we have a couple of questions. So is there a cheat sheet we can give researchers started with interoperability? Are there any good beginner’s resources?
Not that I know of off the top of my head unfortunately. There is an ARDC webpage on the FAIR principles that we have just updated, so I advise having a look at that one, but the struggle with this interoperability business is that in many disciplines the problem hasn’t been solved … sorry, the problem of making data interoperable hasn’t necessarily been solved, and therefore it’s not possible to write a cheat sheet that provides that solution. Certainly say for example linked open data advocates will say, “Oh it’s easy, just turn all of your data into linked open data and then it’s interoperable.” And that’s … they say, “Just,” implying that it’s a really easy thing to do, and look, I’m sure for a linked open data specialist and advocate it is easy for them to do, but when you’re used to working with tabular data, converting that tabular data into a map of connected nodes that linked open data represents, it’s this new paradigm of thinking about data, and … well, it’s a paradigm shift, it’s a big change in how you need to think about your data in order to represent it that way.
Yes. As someone who definitely was working with tabular data, I can see that that is a difficult leap. Now I have two related questions here, are there any examples of linked data with measurement sciences where the units are sent along with the data stream, and then we also have, do you have some good examples of the use of linked open data for supporting interoperability? So good examples.
so I do not, I’m afraid. Yeah, although … no sorry, that’s not entirely true. So I didn’t want to go into too many examples during this talk because when you look at certainly the source of linked data, the XML, the RDF or the [inaudible 00:35:47] that describes that linked data, you look at it and … so it can be a bit overwhelming at first glance, but we can … we’ll share some examples with you for the activity, but for example, Wikidata is a source of linked data that can be used for research, and what the Wikidata project is trying to achieve, is it’s trying to describe the world in terms of linked data.
So for example, Wikidata has a page about the city of Perth, or rather has a datum, an entry about the city of Perth, and then linked to that it has all of these other things that are about Perth, the suburbs of Perth, the streets of Perth, the people of Perth, the buildings of Perth, things like that, and so if you take a look at Wikidata, you’ll see that it is representing and creating this knowledge network, this knowledge graph of the world and a representation in an attempt to make this … certainly this information more useful to people how use linked open data.
Awesome. I have another question, this one calls back to last week’s discussion, can all data necessarily be presented as linked and open? Thinking about accessible.
If you ask a linked open data advocate, then yes, all data can be presented as linked and open, but remember, linked data doesn’t have to be open, so you could have a dataset, or a data collection that is represented as a linked data, but arcing back to the accessible side, it’s not open because it might be sensitive health data, so you have represented that data in a linked format, but you are keeping it closed, therefore it’s linked data, but not linked open data, and in fact we don’t … and we aren’t advocating that all data be linked open data. Sorry, I was highlighting linked open data, is linked data that’s open, doesn’t have to be open, accessible does not require things to be open.
Great, and I have a question of my own.
Just because you know, we have a few minutes. I’m quite curious about, in terms of the triple, that predicate … predicate? Is that the correct word?
They themselves, they would need to be defined terms, right? They need to have their own vocabularies?
Yes, exactly. So one example of that is the [rifCS 00:38:34] metadata standard, so rifCS is used by Research Data Australia for the exchange of the metadata about Australia’s research data and the people who create that data, the people who organize, look after that data, and how that data is related to workflows, to services, to activities, and so the rifCS behind the scenes is based on this linked data model, and those predicates are defined and they are part of the schema, so when you are creating rifCS records, there are only certain predicates you’re allowed to use for it to be valid rifCS.
Because again, a computer can’t make assumptions the way that a person can.
Brilliant. That makes sense. Well that’s all of the questions that we have for today. So unless anyone has anything they want to drop in in the last minute or so, I think that we are done.
Okay great. So thank you very much for coming, and I would like to remind you that the next webinar will be tomorrow at the times on your screen, Liz will be going a bit more deeply into some of these things and certainly talking much more about metadata, rather than me, I was trying to focus more on the data side of things, although in this area it gets really messy, because when you think about it, metadata is just data, and I’d also like to remind you all there’s that link posted to the chat about this webinar from DDI about their cross-domain integration CDI initiative, and we can also share that link in the Slack because I suspect this chat will be vanishing once this webinar ends.
Otherwise thank you very much for coming, thank you for fielding questions for me Nicola, I appreciate it, and I will see the people in my community discussion groups next week, and the rest of you will see me in a fortnight’s time when Liz and I talk about reusable. Thank you.