What is data provenance?
Data provenance is the documentation of where a piece of data comes from and the processes and methodology by which it was produced. Put simply, provenance answers the questions of why and how the data was produced, where, when and by whom.
The word provenance originates from the French term ‘provenir’ meaning ‘to come from’ and is also known as ‘lineage’ or ‘pedigree’. Provenance, as a practice, has been used in the context of art history to document the history of an artwork; and in digital libraries to document a digital object’s lifecycle. Simarlily, recording data provenance, a type of metadata, is important to confirm the authenticity of data and to enable it to be reused. The whole idea of provenance is about trust, credibility and reproducibility.
Why we need provenance
In data intensive research, the data users are not likely to be the data producers. Data producers may configure an instrument or simulation in a certain way to collect primary data, or apply certain methodologies and processes to extract, transform and analyse input data to produce an output data product. Provenance information documents these.
The provision of provenance metadata as part of the published data is important for determining the quality, the amount of trust one can place on the results, the reproducibility of results and reusability of the data.
For data users, the scientific basis of their analysis and accountability of their research rely largely on the credibility and trustworthiness of their input data and so they may want to check data quality along with expected level of imprecision.
How to record and manage provenance
Provenance is recorded as a type of metadata about the data product; many metadata fields routinely collected fall into the category of provenance information, e.g. date created, creator, instrument or software used and data processing methods. Good data management forms the basis of accurately recording provenance.
Approaches to capture and represent provenance can be described on a number of dimensions:
- recorded in a text string; using generic or discipline-specific schema; or a provenance data model
- captured internally within a software tool or program; or in an external system
- represented in machine readable and/or human readable form.
In its simplest form, provenance can be recorded in a single README text file that describes the data collection and processing methods used.
Alternatively, provenance information can be described directly in the W3C Provenance Data Model (PROV-DM) and Provenance Ontology (PROV-O). Provenance information captured in Dublin Core and domain-specific schema can be mapped to a PROV-O representation, so that provenance can be viewed at the domain-specific level and at the more abstract PROV-O level.
Provenance trails can be captured internally by software tools during their processing activity, for example workflow systems such as Kepler, Galaxy or Taverna. The provenance information is typically only available to other users of the same system or exported to a separate provenance store. Systems that adopt the internal approach tend to capture provenance in proprietary ways. Systems that adopt an external approach often use a standard such as W3C PROV-O because they need to interact with many different kinds of systems.
Finally, provenance information can be captured in a way that supports machine-to-machine interactions (for instance, to allow resource identification and location and workflows to be re-run) and/or at a higher level that allows for human users to easily read the provenance trail of a data product or a data processing workflow. In some cases this might just be a textual description, but might also involve a visualisation of the machine-readable representation such as VisTrails.
More provenance information
- The W3C Provenance Working Group recommended six specifications including: PROV Primer, PROV Ontology (PROV-O), PROV Data Model (PROV-DM), PROV Notation (PROV-N), PROV Constraints, PROV Access and query.
- Workshop papers and presentation slides from the International Provenance and Annotation Workshop (IPAW), a biannual workshop concerned with issues of data provenance, data derivation and data annotation.
- ARDC’s Data provenance playlist on YouTube.