Presented: 25 May 2020
Presenter: Matthias Liffers
#3 in the 8 webinar series of the FAIR data 101 training webinars.
Good morning, everybody. Welcome back to the FAIR data 101 course, my name’s Mathias Liffers. And I would like to start by acknowledging the traditional owners of the land on which we all are. I’m in Perth. So I would like to acknowledge the Wadja people of the Noongar nation. I would like to pay my respects to their elders past, present.
So thankfully, despite a crazy storm, moving in over Perth overnight, I still have internet and power. So that means I’m able to present to you this morning. And what we’ll be covering today is part one of Accessible the A in fair. So just a quick reminder, this course has a code of conduct and I welcome you to review that code of conduct any time at that URL. And if at any time you observe a breach in the code of conduct, could you please report it to the AIDC and we’ll follow that up.
Alright, so today’s agenda. No messing about let’s get straight into it. I do have a fair bit to cover and it gets a little bit technical in places. So hold onto your hats. So as I said, we’re covering the A fair Accessible, and specifically out of the four guiding principles. I’m going to aim to cover three of them, because they’re all interrelated.
And they are specifically about the technical side of accessing data and metadata. And so there will be quite a bit about machine accessibility of data this morning. And when it comes to machine accessibility, we’re going to be talking about protocols and the various different protocols that work together in layers. I have got a nice diagram to show you, and how they worked together to bring data from one place to another. And I will be giving a couple of examples of how these protocols can work together to do what it is they need to do for you.
All right. So, not last week, last fortnight, two weeks ago Liz and I covered the F in fair how to find the data. And so accessibility is or no way that data is, or the metadata is, how do we actually get to it? How do we get our hands on it to be able to work with that data? Sorry, I’ve put all three up on the slide at once because they are all connected to each other and all related.
What we would like to see, A1 is that metadata and data are retrievable by their identifier using a standardized communications protocol. We would also like that protocol to be open, free and universally implementable. And then finally, A1 part two the protocol allows for an authorization, sorry, an authentication and authorization procedure where necessary. And Liz will be speaking a bit more later this week about the ins and outs of where this kind of authorization might be necessary.
Why do we care about machine accessibility? So I did cover that in my last webinar. Given the increasing volume and speed of data generation and data collection, where researchers are relying increasingly on computers to process that data for them. So I’ll give a quick example of where we might like to do that, as a relatively simple example. Okay.
So let’s imagine that we are running an experiment that requires relatively real time knowledge of how warm it is. So off the top of my head say we’re looking at solar panel generation across an entire city. We’ve already got a source of data for how those solar panels are generating, but we want to see how the weather that might affect that. Now we don’t necessarily worry too much about putting sensors with each and every set of solar panels. Because that’s a lot of work.
Thankfully there is a well established organization that collects weather data and reports it in quite high increments. I believe they update this data in fact, every 10 minutes, it says right there. So the Bureau of Meteorology has weather observations across the entire country and publishes those every 10 minutes to their website. If we want to incorporate this data into our experiment, one way we could do this is by visiting this page every 10 minutes and grabbing the latest figure and plugging that into our experiment, which to me sounds an awful lot of work. Even if you only have to talk during daylight hours for solar panels.
So what we would like to do is to get a computer to visit this page every 10 minutes and get that latest data for us. Now, this webpage is designed for humans. And even if we go to the HTML source of this webpage, so we’ll see here that the first temperature is down here on line 296 of the HTML source. Now from a programming point of view, you actually have to teach the computer a fair amount to make it ignore everything up until that point and find that exact value that you’re looking for.
So what the Bureau of Meteorology has done is also make all of this stuff available in adjacent format. Sorry. So in fact, if we go back to our webpage there is, no sorry. I think it’s further down on the webpage and this is just a screenshot. There is a link to this data in other formats. And if you click on that, then you can get through to this JSON file.
Now we have located, this is the very beginning of the file and it starts here with this bryce, curly bracket and oh dear, sorry. And it’s actually not very long before the actual data, the important data appear. So that’s only about 20 lines in, and with this nice structured JSON data it is only, I think two, no three, three layers down in the nested hierarchy of JSON. And that is much easier to teach the computer to access than it would be to go through an entire HTML webpage, which was designed for humans.
So what we can do with this JSON data is get the computer to check this JSON data every 10 minutes, grab the important value and plug that into our experiment. Meaning we don’t need to constantly refresh webpages unless it’s going to be a record high and we just want to talk to our workmates about the weather. So to do this, to be able to get a computer, to grab data from somewhere else. We need to use protocols.
Now you might already be familiar with the more human centric definition of protocol, probably a little bit older. And that a protocol is an accepted code of conduct or acceptable behavior in a given situation or group, and really a computer protocol isn’t that different. It is an accepted behavior. A way to behave that is predictable and follows rules so that other computers know what is going on as well.
So the computer definition, a set of formal rules describing how to transmit or exchange data, especially across a network. Now what we would like these protocols to be, and I’ve picked some keywords out of the A1 and A1.1 guiding principles. We would like these protocols to be standard, open, free and universally implementable.
Now in terms of open and free, there can be overlapping definitions between those two. What we would ideally like in this case is for this standardized protocol to be openly available in the same way, open access. So everybody can access it without any barrier of any kind, including cost. And free, it could mean the same in terms of costs, or it could mean free as in everybody is free to contribute to implement. Free as in Libri is what some open source nerds like to call it.
And finally, universally implementable. Everybody should be able to implement this standard protocol without any barriers. And there is enough detail in the standards to be able to do that. Now to make it a little more difficult. No, sorry, I’m getting ahead of myself. Why do we want all of these principles to apply? So we can trust what’s going on.
So we can trust that the computer or the program on the other end of the connection is delivering the data to us in a format we know will work, via a method we know will work and we don’t need to worry about coding absolutely everything from scratch. From how to computers talk to each other by blue cable up to how the data gets from your computer in this side of the country to a computer over on the other side of the country.
And the way these protocols do this is by… No, sorry. I’m getting ahead of myself again. And a lot of the protocols used do have standard identifiers. They have PIDs in effect. So for example, you might not have heard of all of these, so hopefully you’ve heard of quite a few. So you’re probably familiar with HTTP, Hypertext Transfer Protocol. That is what web browsers use to collect data or to connect to web service and grab web pages.
Ethernet it is that blue cable that you have hopefully plugged your laptop into your router to get the best possible internet connection while you’re working from home. Otherwise you’re using your WiFi, which might not be as reliable. And then we had some questions, Liz Stokes’s webinar about XML versus JSON. If you’d really like to know the differences between the two, the standards are available via their identifiers.
So you can inspect them and compare and contrast them. Now, we say that four of the standards here are using DLYs. They’re all from the same organization that has chosen to mint DLYs for all of its protocols. Two other standards to do with physical network connectivity, Ethernet and WiFi, they’re from IAAA. Yeah, MQTT is a protocol used in the internet of things.
It’s a very lightweight, low power protocol for shuffling small amounts of data very quickly. Unfortunately, they haven’t come up on the PID party as it were. And then XML, we have this URL that is a persistent identifier in and on itself. So you can always go to the XML standard by visiting that one.
Alright, now that’s a lot of protocols. Why is it that we need so many protocols? Well, apart from if you’re trying to transfer internet data with birds, what you might like to do is have a look at the different layers involved. So protocols are life ogres who internal are like onions. They have layers. And protocols build up on each other to get data moving around.
Okay. Now, I’m sorry for exposing you to this diagram. This is something that you might learn in say, a networking university course. This idea of a layered model where protocols build up on top of each other in order to shift data around. At the very bottom layer, we have the link level, which deals with the physical connections between two devices and how those two devices, the two ports communicate with each other over a cable, or in the case of WiFi, how two radios communicate with each other and swap data between them.
On top of that, we have an internet layer, which deals then with have two computers on the internet, communicate with each other. They don’t care what’s happening below on the link layer. So some computers are using Ethernet some computers are using WiFi. Internet layer doesn’t care about that. It’s dealing at a higher level as it were.
You know what? I’m going to skip to the next slide because I’ve actually put some helpful diagrams on here. Now, as on the slide before we’ve got the different protocols in use. Each and every one of these things on the tree diagram on the left is a protocol. And you hopefully will recognize HTTP and Ethernet. Now, HTTP is layered on top of TCP, Transmission Control Protocol, which is layered on top of the internet protocol, which is laid on top of the Ethernet protocol.
Okay. So why is it that I’m showing this to you? It’s because the word protocol is an incredibly loaded term. There are a lot protocols and no protocol really works in isolation. It depends on other protocols. It works with other protocols in order to get the job done.
So if you are talking to a librarian or a researcher or somebody in the IT department about protocols, it’s important to make sure that there’s a common understanding of what kind of protocol you’re talking about. Because you could say, for example, “Well, my data is available. You have to plug a cable into my laptop and then you can get your data from me.” So if we’re using the Ethernet protocol, that is a standardized open, actually it’s not entirely open. It does cost money to get practice standards from the IAAA, but it is a standard protocol. Is that not fair?
Not entirely because we’d like to have a few more protocols working hand in hand so that you don’t need to plug in a physical cable, that you can access something remotely as well. That being said, when it comes to the actual crux of the situation. Making our data and metadata available or accessing metadata and data from somebody else, we only really care about the very top layers. Because we can assume that the other layers below are going to work.
So for example, most of Touchwood universities in Australia use RMIT. And RMIT provides the network infrastructure provides the cables as well as the routing infrastructure that works. And we can trust so that we can get our data from point to point without worrying about digging trenches and putting down cables ourselves. Al right. Mostly.
So, depending on where it is you’re working. So for example, okay. RMIT’s great within Australia, but if you try to grab data from say a mountain top in the Himalayas, you might have to work out some other way of transmitting that data on the physical layer. Because there’s not very good phone services in the top of the Himalayas. All right, let’s get into some examples about how these protocols all work with each other.
Now the first one, the first example, now I’ll go into a bit of depth on this one. Is about how repositories can share metadata with each other. So you may have heard of OAIPMH, which is very widely used, especially in library and institutional repositories. And it is a method used to harvest metadata or transfer metadata from one repository to another.
So for example, in institutional public patients repository might want to have its metadata harvested by say a centralised search functionality. So that end users only have to go to one place to search all of the repositories in Australia. For example, Research Data Australia. Now, OAIPMH which stands for, The Open Archives Initiative Protocol for Metadata Harvesting. Now won’t be tested on that. Don’t worry.
So OAIPMH in and of itself is a protocol shares, Dublin Core Metadata plus other metadata status, but let’s stick with Dublin Core for this example, which could be considered a protocol as well. As XML yet another standard over HTTP. So when the repositories talk to each other, they behave like a web browser and a web server. So, how does that work?
All right. So here’s a cool project at Griffith University, the prosecution project, and they have digitized metadata about basic late crimes and prosecutions in colonial Australia. The data that I was looking at for it was allegedly pre Federation, I’m not sure if they have anything post Federation. And they have a repository and they make their metadata records available through the OAIPMH API actually IPO. That’s a term I haven’t defined.
You’ll also hear the word API, every now and again. And an API is a method for two computer programs to talk to each other. So they’re not interfacing with a human, it’s the two programs talking to each other, using an API, which uses a bunch of protocols to shift data around. Okay. This URL here is to access OAIP, sorry. There were too many acronyms in this presentation. The OAIPMH API and you can get metadata records.
Let’s break it down. So first up we can see that it uses HTTP or in this case, HTTPS, S is for Secure. So the connection between the two computers is verified with the security certificate. Next up we have a URL, the same as any other kind of domain name, host name. URL www.AIDC.EDU.AU. Exactly the same, except in this case, OII.prosecutionproject.griffith.EDU.AU. Okay, nice. Then when we’re accessing that server, we ask that this isle or directory /OHI, that’s the end bit of the URL.
And then after that, we have providing some instructions to the API as to what it is that we want and they’re encoded in that URL. So first up there is a parameter called verb. And we’re saying the verb we want is list records. So we’re instructing this server to give us a list of records. And then the next instruction is please give it to us in the form of Dublin Core Metadata.
So you can actually visit this URL in your web browser. And you’ll get something that looks like this. Now, it might be different depending on when you connect to that API because they change metadata, things like that. So the data that you get back will depend on when it is that you access this particular server. And what we have here is some XML with Dublin Core embedded within it. So we can say, there’s this first record here, transcription of trial record, Thomas Matthews assault and attemp to rubbing company Melbourne 1852.
We’ve got all sorts of metadata fields around that particular record. So what can we do with this? Well, we can using our own software that we create harvest records, take records from the prosecution project and we can use those records to do our own kind of analysis. So if you’re investigating crime in pre Federation Australia, Prosecution Project could be a good source of data to do so.
Okay. I’m still putting together the fine details, but I’m hoping to get everybody to use an API in the activities for this module. Hopefully that’ll be good. Fun. Okay. So not all standardized communications protocols are about swapping XML metadata around, for example. Now earlier I mentioned the MQTT protocol, now don’t ask me what that stands for. I’ve actually forgotten.
But it is used in the internet of things or in sensor networks. So, if you have built a sensor network or you’re a researcher, you want to build a network of sensors, say temperature loggers or humidity sensors. And you want to have them wirelessly around the building so you know what the temperature is in each room or in several buildings. You can use something called the MySensors framework to build that network of sensors.
And if you use MySensors one of the options for getting the data from that sensor network is to get it in the JSON format over the MQTT protocol. Now, again, we haven’t necessarily spoken about the lower levels because like HTTP, MQTT is one of these higher level ones, and it assumes that you already have the rest of the protocol set up and working. All right, now I’m getting through this a little faster than I thought I would which is nice. Because that means we’ll have plenty of time for questions.
Okay. Now, it can get quite confusing because there are so many protocols available. This would have been an excellent opportunity to put in an XKCD comic, because I’m sure they’ve got something about this. But there are lots of protocols and you can pick and choose different protocols to do what it is that you need to do. And there is almost an infinite number of combinations of protocols.
Now the good thing is, we don’t necessarily, so we people working in research support, we don’t need to memorize all of the communications protocols and know how they work and what the problems and pitfalls are with different protocols. We can talk to a more technically minded staff. So we have research support engineer, sorry, research software engineers. Well RSEs come from either a software engineering or a research background, and they combine this understanding of research with really deep technical knowledge of software engineering.
They can build software that supports research. Or similarly there are data engineers and these engineers are familiar with all the standards and how to engineer together a system with some kind of data pipeline to get data from A to B, with a bit of processing along the way. And hopefully then now they will not be able to bamboozle you by talking about bunches of different protocols, you’ll have some understanding of what it is that they’ll be talking about.
Now, what this also means is that when it comes to sharing data or making data accessible, given the diversity of standards and protocols and things like that, it is quite unlikely that a single repository solution will serve the needs of every single researcher at an organization. So for example, your traditional data repository will be geared mostly towards having flat files. So you can load it up with tabular data, a spreadsheet, or a CASB, or JSON or XML files, and people can grab those files but you will not necessarily be able to offer a wealth of different APIs and protocols for harvesting that data in different ways.
Okay, here’s a good example, the square kilometer of a rail part of which is going to be built or is being built in North of me up in the virtuous. The sheer volume of data that the instrumentation up there generates is so huge that it’s simply not feasible or practical to involve an institutional repository that requires incredibly specialist processing communications equipment. And then accessing that data, again, would be handled by specialist solutions just developed for that data, but they will be using standardized communications protocols.
So that being said, the metadata and information on how to access that data could be placed in an institutional repository. But then that record in that repository would point to a different location. So very importantly, data and metadata don’t necessarily need to be co located.
Okay. This was just the tip of the iceberg talking about some of the research infrastructure for shunting data across the world. And so there is this topic of or discipline of infrastructure literacy. The knowledge that it would be really nice for researchers and research support professionals have in order to get the most out of the huge amounts of money, millions if not billions of dollars put into the infrastructure that supports us. And RMIT in particular has done a pretty good job of developing modules to teach researchers how to use their infrastructure.
So Dr. Sarah King, one of their trainers would be willing to give you further training in RMIT offerings, for example. Might be tricky at the moment. Might have to be done by Zoom but get in touch with her if you’d like some free training. Okay. Now, I spoke to myself. Data and cryptodata do not need to be located. All right, let’s get past that.
So up until now, I haven’t spoken about authentication and authorization at all. Now this is something, since we’re running out of time, I’ll have to spend far less time talking about, however, everybody here should already be familiar with authentication and authorization procedures in the form of usernames and passwords. So if we are authorized to access a resource, we are provided with the authentication credentials in order to access that resource. Now the how and why of the authorization. That’s what Liz will be offering that on Wednesday.
However usernames and passwords are good for humans, but they’re not necessarily used very much by computers when they’re talking to each other, especially via APIs. What is more common for a computer to use is an API key, and API keys they should be considered as protected and private as a password. So if you were ever given an API key to access an API, treat that as securely as you would treat a password. That if your keys you could consider it to be is a password just without a username.
So it’ll probably be quite long randomly generated and is unique to you or the service that is trying to be accessed by enough. Yeah. So you will hopefully during the activities be getting an API key to access Trove which is from the national library of Australia. And so please treat that API key quite carefully. Okay. Now the infrastructure and policy.
So for the authorization and authentication, the infrastructure and policy needs to work together to ensure that data that needs to be kept safe is kept safe, but is still made available to those who need access to it, who are authorized to access it. So for example, there are numerous sensors around Australia for a data linkage and a lot of these sensors work with sensitive health information health data. But they want to be able to link different patient records together to draw conclusions and come up with answers to research questions around health outcomes.
Now they are authorized to access that data and they would have certain ways of being able to access their data and bring it into their secure processing environment. Through some way, and it could be using these authentication procedures. Yeah, sorry. Yeah, so they have those systems built in using authorization, usernames, passwords, or keys or something like that to shut the data around and keep it safe, very important to keep it safe.
Okay. So that is it for me. So next up, the next webinar will be on Wednesday, when Liz delves into this idea of the authorization to access the continuum of closed to open data and how to make sensitive things accessible while still keeping them safe. And that will be at the time on your screen. Now, I believe we have some time for questions. Are you there, Liz?
Hi Mathias. Yes. We’ve got time for questions. Although there aren’t any questions in the question box or the chat at the moment, except for a nice job tackling protocols materials from one of our participants.
So I invite anyone if this dive into protocols has got you thinking or has got you floundering, look, the floor is yours. Please have at it in the question box and share with us some of those questions. Even if it was, if it’s something like… Mathias, what was that first protocol you shared with us?
What was the first protocol I shared with you? Possibly ethernet? I can’t recall. Now, if you are still digesting and need to have a bit of time for all of that to settle a bit, before you come up with questions, you can ask me in Slack and there will be opportunity in the community discussions next week as well to have a chat about it.
Mathias, we’ve got a question and I’m going to ask you now. How much of this are researchers required to know?
As much as is required for them to be able to get their work done. Now, I like to think that researchers can rely on research, support professionals like us to know this kind of stuff for them, so that they can consult with us, get the solution that they need and then get on with their research. I mean, most people didn’t get into research to deal with administration or IT or things like that. They got into research to do the research.
However, if a researcher is working on the cutting edge of things and deploying sensor networks and things like that, it might be useful for them to know how that technology works so that they can account for that in the design of their experiment. So you can have particularly time sensitive things or huge volumes of data need special treatments. And if you do require incredible precision in an experiment, you really need to know how your data’s being generated to understand what kind of errors might crop up.
Mathias. That sounds like it might actually lead into the next question I have for you, which is what are the key questions to consider when assessing whether an application we’re thinking of using for our data is an accessible application in terms of these types of protocols?
Unfortunately I’m not probably going to use some taillights free letter acronyms. I would say if that particular system or software solution does have APIs and especially APIs that are well-documented. So that you and anybody else can access the documentation of that API to then be able to construct your own solution to talk to. Or you can say, say, let’s go to the good old institutional repository solution.
I’d like to implement a repository, and I want to make sure that that repository can be harvested by something like Research Data Australia. I need to make sure that that solution has the correct APIs and uses the correct protocols to let Research Data Australia harvest it. So check the documentation, ask hard questions of the developers or the vendors and make sure you get what you need and what you want. And especially in terms of commercial software, make sure you get what you paid for.
Great. Thanks Mathias. I have another question for you, with an apology prefacing it, just in case this might be out of scope and potentially it is, but I’ll ask it anyway and you can handle it. Could you talk a little bit more about how linked data works e.g. does everyone use the same protocols in linked data?
Boy. Okay. So linked data and linked open data, certainly to me and my understanding of it. And in fact, we might get into this in the interoperability, good job hand balling this to future Mathias. Well linked data is a set of principles around linking data records together with identifiers. Now there are some very common standards used for linking these data to each other.
Sorry, it’s been a little while since I’ve touched on it deeply, so I might need to bone up, but we’ll see how we go. So there is a admitted off standard called RDF. I can’t remember what it stands for.
Research Data Framework.
Not Research Descriptive Framework?
Yeah, Research Descriptive Framework. Sorry.
That’s the one, it is based on XML. So it uses XML for it’s structure. But linked data doesn’t have to be expressed as RDF. You can express link data in different formats, for example, in JSON. So in short, a linked data is this principle of structuring data, but it can use a variety of different standards and different protocols. So more of a paradigm I would say.
Okay, I’ve got a couple more questions and then I think this might be it for today’s session. So here’s one, for many researchers, will the focus be on machine accessible protocols or human readable policy? What about for research support professionals or technologists and developers?
Well, it’s important to have both, to be honest, both machine readable and human readable. Because you can make your data machine accessible and machine readable, but without the human readable documentation and policy behind it that describes how these things work to humans, humans would not be able to implement them. So, I said earlier, when you’re considering a system, make sure the documentation is up to scratch. Make sure that whatever APIs they’ve developed has good documentation because those API APIs are next to useless, unless there is a way for a human to learn how they work so that human can then develop their own software or solution to access that API. In the way that having only a human readable policy and documentation is next to useless to a computer, if there’s no machine accessible or readable things to work hand in hand.
Nice one. Okay. So our final question, which on reading maybe it’s a nice round-up one. Going back to some of your earlier points Mathais is what does the researcher need to consider for accessibility if the researcher is mainly concerned with sharing their primary data?
Okay. This now then would probably more about what Liz is going to be covering on Wednesday. Because primary data is incredibly valuable and incredibly personal. And I certainly understand why many researchers would be reluctant to share that data. I mean, there’s always this well founded fear of being scooped.
Although I did hear the other day that when it comes to primary data, your average researcher has already has a 12 month advantage over anybody else trying to understand that data. So if you had your primary data and you made it available, then it would take 12 months for another researcher from getting a copy of that data, to be able to understand it, analyze it, and then actually write any publications out of it. But I will otherwise handball that to you Liz, to deal with on Wednesday.
Thank you. I shall take that on notice. All right, well, that’s it for questions. And a commendation to your answer on having the focus for researchers on machine accessible versus all the protocols versus the human readable policy question.
Thank you very much for facilitating those questions for me Liz. Now as I said, I am on Slack. You can ask me questions there. Some people have sent me some private questions and private messages, but if you think your question could be of interest to the rest of the community, please ask in that general channel. So everybody else can answer. Well, I’ve seen that there’s already been some great discussion about the Australian Data Archive. Otherwise I think that’s it for me. Was there anything more from you Liz?
Just to remind everyone to fill in the post-webinar survey. Thank you for your feedback for our last webinars that was really valuable and we continue to look forward to your suggestions and ideas for how we’re going.
Yes. Great. Thanks for that, Liz. And thank you everyone for coming, and I will see some of you next week during our community discussions.