What is meant by the term ‘data versioning’?
A version is “a particular form of something differing in certain respects from an earlier form or other forms of the same type of thing”. We may regard a new version to be created when there is a revision of a resource in its structure, contents, or condition. In the research environment, we often think of versions as they pertain to resources such as manuscripts, software or dataset.
In the case of research data, a new version of a dataset may be created when an existing dataset is reprocessed, corrected or appended with additional data. Versioning is one means by which to track changes associated with ‘dynamic’ data, i.e. data that is not static over time.
Why is data versioning important?
Researchers are required to cite and identify the exact dataset used as a research input in order to support research reproducibility and trustworthiness. This requires good management of data and data revision, it becomes particularly challenging where the dataset to be cited are ‘dynamic’ - under constant changes and revision.
This concept is summarised well in the W3C Data on the Web Best Practices guide:
“Version information makes a revision of a dataset uniquely identifiable. Uniqueness can be used by data consumers to determine whether and how data has changed over time and to determine specifically which version of a dataset they are working with. Good data versioning enables consumers to understand if a newer version of a dataset is available. Explicit versioning allows for repeatability in research, enables comparisons, and prevents confusion. Using unique version numbers that follow a standardized approach can also set consumer expectations about how the versions differ. Intended outcome: Humans and software agents will easily be able to determine which version of a dataset they are working with.”
There is currently no agreed standard or recommendation among data communities as to how and when data should be versioned. Some data providers may not retain a history of changes to a dataset, opting to make only the most recent version available. Other data providers have documented data versioning policies or guidelines based on their own discipline’s practice, which may not be applicable to other disciplines.
There is currently a discussion in the global community as to the need for an agreed best practice for data versioning across data communities. The Research Data Alliance Data Versioning working group has come up with the following guidelines for data versioning:
Revision (version control):
- A new instance of a dataset that is produced in the course of data production or data management that is different from its precursor is called a “revision”.
- A dataset revision should be identified. Whether a new identifier needs to be minted will depend on the repository policies and use case.
Release (data products):
- The release of a new version of a dataset should be accompanied by a description of the nature and the significance of the change.
- The significance of this change will depend on the intended use of the data by its designated user community.
- Each new release should have a new identifier.
Granularity (aggregates, composites, collections and time series):
- Data may be aggregated and combined into collections or time series.
- The collection should be identified and versioned, as should be each of its constituent datasets.
- Entire time series should be identified, as should be time-stamped revisions.
Manifestation (data formats and encodings):
- The same dataset may be expressed in different file formats or character encodings without differences in content. While these datasets will have different checksums, the work expressed in these datasets does not differ, they are manifestations of the same work.
- Manifestations of the same work should be individually identified and related to their parent work.
Provenance (derived products):
- The definition of revisions and releases signifies that a dataset has been derived from a precursor and is part of the description of its lineage, or provenance.
- Provenance can be more complex than following a linear path. Information accompanying a dataset release should therefore contain information on the provenance of a dataset.
What tools are available for data versioning?
There is no one-size-fits-all solution for data versioning and tracking changes. Data come in different forms and are managed by different tools and methods. In principle, data managers should take advantage of data management tools that support versioning and track changes.
Example approaches include:
Git (and Github) for Data (with size <10Mbit or 100,000 rows) which allows:
- effective distributed collaboration – you can take my dataset, make changes, and share those back with me (and different people can do this at once)
- provenance tracking (i.e. what changes came from where)
- sharing of updates and synchronizing datasets in a simple, effective way.
- Users of ArcGIS can create a geodatabase version, derived from an existing version. When you create a version, you specify its name, an optional description, and the level of access other users have to the version. As the owner of the version, you can change these properties or delete a version at any time.
Citation of versioned data
There is no universal way to cite versioned data. The form of citation statement will depend on a number of factors including publisher instructions, research domain and type of data. Citations to revisable datasets are likely to include version numbers or access dates.
DataCite recommends: Creator(s) (Publication Year): Title. Version. Publisher. Identifier. as the format for citing data with a version number.
- Harwood, Tom; Williams, Kristen; Ferrier, Simon; Ota, Noboru; Perry, Justin; Langston, Art; Storey, Randal (2014): Nine-second gridded continental Australia change in effective area of similar ecological environments (cleared natural areas) for Amphibians 1990:1990 (GDM: AMP_r2_PTS1). V1. CSIRO. Data Collection. http://doi.org/10.4225/08/54815C68BEF05
- Harwood, Tom; Williams, Kristen; Ferrier, Simon; Ota, Noboru; Perry, Justin; Langston, Art; Storey, Randal (2014): nine-second gridded continental Australia change in effective area of similar ecological environments (cleared natural areas) for Amphibians 1990:1990 (GDM: AMP_r2_PTS1). V2. CSIRO. Data Collection. http://doi.org/10.4225/08/5486764AD2F64
- Ball, A. & Duke, M. (2015), ‘How to Cite Datasets and Link to Publications’, Digital Curation Centre
- Klump, J., Wyborn, L., Downs, R., Asmi, A., Wu, M., Ryder, G., & Martin, J. (2020). Principles and best practices in data versioning for all data sets big and small. Research Data Alliance. DOI: 10.15497/RDA00042.
- Klump, J., Wyborn, L., Downs, R., Asmi, A., Wu, M., Ryder, G., & Martin, J. (2020). Compilation of Data Versioning Use cases from the RDA Data Versioning Working Group. Research Data Alliance. DOI: 10.15497/RDA00041
- NRC-CISTI (n.d.) Datasets and DOIs: guidelines from DataCite Canada
- Rauber, A., Asmi, A., van Uitvanck, D., & Pröll, S. (2016). Data Citation of Evolving Data: Recommendations of the Working Group on Data Citation (WGDC) (Technical Report). Denver, CO: Research Data Alliance. https://doi.org/10.15497/RDA00016
- Stanford University Libraries (n.d.) Data versioning