The development of institution-wide solutions for the discovery and reuse of research data collections is important.
It ensures that these collections are properly managed so that they can be harvested and exposed to search engines as well as to researchers and research administrators. Metadata stores are a key component of this infrastructure.
Types of metadata stores
Metadata stores can be distinguished by their coverage, the granularity of data that they describe and the specialisation of their descriptions.
Based on coverage, types of metadata stores include:
- Local metadata store: coverage over data produced by a single instrument or research group.
- Institutional metadata store: coverage over data produced across the institution, typically by a variety of research groups and disciplines. Institutional metadata store solutions tend to be generic, since their metadata descriptions cannot be discipline-specific. However, an institutional solution can be configured to provide different solutions for different disciplines.
- National metadata store: coverage over data produced across a country, by a variety of institutions. Research Data Australia is an instance of a national store.
- Discipline-specific metadata store: coverage over data produced within a discipline, across a variety of research groups, institutions, and (typically) countries.
Metadata about research collections is best created and managed close to where the research data is created, in local metadata stores, tightly integrated with research groups and their activities. This metadata should be easily accessible and relevant to researcher needs.
However, metadata stores with broader coverage are essential if data collections are to be discovered, tracked and used outside the immediate context of the research – across a discipline or an institution. Stores with broader scope are likely to have more users than local stores, and institutional and national stores use more generic formats, applicable to more domains. Stores with broader scope typically act as metadata aggregators, gathering metadata (or appropriate distillations of metadata) from local systems.
Based on granularity, types of metadata stores include:
- Collection-level metadata store: describes data collections (collections, datasets, etc)
- Object-level metadata store: describes individual data objects (files, database rows, spreadsheets, physical objects).Object-level stores are typically specialist, because discipline knowledge is needed to make sense of individual data objects.
- Integrated metadata store: describes both individual data objects and the collections that they comprise, in the one system and is typically coupled with data storage for the data being described.
The level of specialisation of metadata within a metadata store depends on who will be using it. Both specialised (of interest to a discipline specialist) and generic (of interest to a general audience) metadata are necessary. Specialist metadata may be generated first (especially if automated), but is usually difficult for it to be repurposed automatically into generic metadata.
Data capture often produces specialist metadata automatically. If a specialist store is managing data objects and the discipline needs to organise those objects into a collection, it will usually do so as an integrated store, so that the management of objects and collections is co-located.
Institutions are all different and have different needs and approaches. There is no single solution that fits all. Nevertheless, institutions should consider deploying an existing solution rather than duplicating development effort internally.
Local metadata stores
Local metadata stores are crucial to good research data management and populating broad-scope metadata stores. Researchers should consider the following requirements for their local stores.
The local metadata store should:
- Store metadata that supports discovery and evaluation of data (e.g. keywords).
- Store metadata in a format which is in common use in the discipline.
- Store metadata that supports reuse of data (e.g. experimental configuration, interpretation of dependent variables, access rights – these may simply be a link to a separate file or a paper).
- Export metadata to other formats commonly used in describing metadata, especially in metadata aggregators (note that OAI-PMH requires a feed to be available in Dublin Core).
- Support aggregation of metadata (harvesting and/or syndication) to (inter)national data discovery services (like Research Data Australia and Google Dataset Search) and (inter)national discipline registries. Metadata may be harvested from a web API (such as OAI-PMH) or retrieved from structured data within repository web pages (for example the schema.org vocabulary in either JSON-LD, RDFa or Microdata).
- Support automated gathering of metadata from instruments (e.g. file header), and of related metadata from other databases (e.g. Instrument booking systems, HR systems, grants programs).
- Integrate in researcher workflows with minimal disruption (e.g. through web services & APIs).
- Allow error checking, validation, and use of controlled vocabularies.
- Allow metadata describing both collections and objects within collections, if that is appropriate to the discipline.
- Allow hierarchical organisation of metadata, where appropriate to the discipline (e.g. ordering metadata by project and/or experiment).
Not all metadata store solutions will satisfy all requirements; automated metadata gathering and integration, in particular, are not widespread, and should not automatically disqualify a candidate store. All these features are worth considering in evaluating candidates, and researchers and research groups need to work out which features are priorities for them. The highest priorities are likely to be commonly used formats, hierarchical organisation and aggregation support.
Descriptions of data collections should not be seen as information islands. They need to be connected to other kinds of information, which may be stored and managed in different data stores. For example, the authoritative source of truth for information about people can be HR and Research Office systems. A metadata store should be reusing that metadata, rather than creating its own records. A characteristic of high-quality metadata is that it is created once and then reused as needed.
If the contextual information is common across different institutions, it is appropriate to have a common external authority for the information. A common description of a grant, project or researcher across institutions allows users to navigate between data collections held by different institutions, but involving the same research team members.
Deploying a metadata store solution usually involves integrating multiple sources of truth, possibly including external sources of truth. If such data has already been aggregated or centralised in the institution (e.g. as a data warehouse), it can be exploited by institutional metadata stores.