Before identifiable information can be collected, used, or shared, researchers must consider relevant legal and ethics requirements such as privacy legislation and informed consent.

While access to data should ideally be as open as possible, with sensitive data and especially identifiable data it should also be as closed as necessary. It is possible to reduce the identifiability of data through techniques referred to as ‘de-identification’, ‘anonymisation’, or ‘de-personalising’. Newer approaches such as generating synthetic data also aim to reduce the identifiability of information, however these methods may not suit all research designs. For example, they may be appropriate for quantitative analysis but not for qualitative studies where the validity of the research may be reduced if synthetic data are used.

Regardless of the techniques used, in the current age of big data and triangulation methods there is debate whether any method exists that can reliably ensure the complete removal of identifiable information from data. This does not mean that data can not be used or shared for research, but that well-defined approaches for managing and working with data must be implemented.

Working with identifiable data

Management of identifiable data

Data may often need to be identifiable (i.e. contains personal information) during the process of research, e.g. for  study administration, qualitative analysis, etc. If data is identifiable then ethical and privacy requirements may be met through access control and data security but establishing a well-defined data management plan before a research activity has begun is the most effective way of meeting these requirements. This may include:

  • control of access through physical or digital means (e.g. passwords)
  • encryption of data, particularly if it is being moved between locations
  • ensuring data is not stored in an identifiable and unencrypted format when on easily lost items such as USB keys, laptops and external hard drives
  • taking reasonable actions to prevent the inadvertent disclosure, release or loss of sensitive personal information.

Five Safes: Working with identified data

The UK Data Service has developed the Five safes framework to provide secure access to carry out work that would not usually be possible with de-identified data. It offers data custodians a framework to place appropriate controls, not just on the data itself, but on the manner in which data are accessed.

In Australia, in addition to the Commonwealth legislation, which sets out thirteen privacy principles, almost each state and territory has its own privacy legislation. The Office of the Australian Information Commissioner offers links to all this legislation.

Data de-identification

’De-identification’, ‘anonymisation’, and ‘de-personalising’ are approaches commonly undertaken to protect the privacy of individuals and the terms are sometimes used interchangeably, though there is debate about whether this is appropriate. They all aim to reduce the identifiability of data but, as mentioned above, the ability to completely remove the risk of identification is a matter of contention. To simplify the following discussion, the term ‘de-identification’ shall be used to refer to this group of methods

In addition to protecting individuals, data de-identification may also be used to protect organisations, such as businesses, or other information such as the spatial location of mineral or archaeological findings or endangered species. Data de-identification is not an exact science and judgement calls may still need to be made when de-identifying data.

It should be noted that de-identification is not a ‘magic bullet’ for being able to share and publish sensitive data. De-identification should be considered within a range of activities to protect the privacy of research participants, such as obtaining informed consent for data sharing and controlling access to the data.

Additionally, the validity of some research may be reduced if de-identified data are used for analysis (e.g. qualitative studies of oral histories, historical texts, and stories). But then when archiving or publishing either excerpts, derivatives or aggregates of that data, it may be critical to either mask the identity of the individual in the data or metadata to protect their privacy.

It is therefore critical to have a clear plan for managing identifiable data through all research stages and when publishing data. Understanding the requirements and risks of using identifiable data at each stage of research will inform the kinds of consent, data security, and access controls required.

Best practice basics for managing de-identification

Here are some tips to start your de-identification:

  • plan de-identification early in the research as part of your data management planning
  • retain original unedited versions of data for use within the research team and for preservation
  • create a de-identification log of all replacements, aggregations or removals made
  • store the log separately from the de-identified data files
  • identify replacements in text in a meaningful way, e.g. in transcribed interviews indicate replaced text with [brackets] or use XML markup tags e.g. <anon>…..</anon>.

For more in depth information and processes see the resources below.

Australian practical guidance for de-identification

International practical guidance for de-identification

Qualitative and audio-visual data

When dealing with qualitative data, such as transcribed interviews, or textual answers to surveys, rather than blanking-out information, pseudonyms or generic descriptors can be used to replace identifying information. Audio and image files can be digitally manipulated to remove identifying information. However, techniques such as voice alteration and image blurring are labour-intensive and expensive and are likely to damage the research potential of the data.
Agreeing during the consent process as to the level of anonymity required will determine what may and may not be recorded, transcribed, or shared. This can be a more effective way of creating data that accurately reflects the research process and participants contribution, than removing sensitive information post collection. If confidentiality is an issue, it may be better to obtain the participant’s consent to use the data unaltered, but with additional access controls in place.
The UK Data Archive has advice on anonymising qualitative data, and the Irish Qualitative Data Archive has developed a tool for anonymising qualitative data.



Related topics

Working with data

View Now
View Now

Data management plans

View Now
View Now

Sensitive data

View Now
View Now

Indigenous data

View Now
View Now