Perspectives

Conducting Quantitative Risk Assessments for Anonymized Datasets and Documents: What This Means for Sponsors and Patient Privacy

CANTON, Mich. (10/2/2022) – Data anonymization and document anonymization in clinical trial data is now more important than ever.

In recent years, clinical trial data sharing has become a requirement as part of the regulatory process for EMA and Health Canada.

It is now required to anonymize the personal information of trial participants before it is shared to protect the participants’ privacy and stay compliant with privacy protection laws- whether for regulatory requirements or voluntary data sharing.

Due to evolving technology and the availability of clinical trial data in various forms, there is always a concern about the re-identification of trial participants, even with data anonymization.

Therefore, assessing the inherent risk of re-identifying a trial participant in the shared data is required.

Estimating the risk of re-identification in anonymized dataset

Risk can be defined as the probability of re-identifying a trial participant. Estimating risk means determining the probability that an intruder would discover the correct identity of a single record.

The re-identification probability depends on the number of participants sharing the same identifiers across the dataset.

The risk level (maximum or average) that needs to be considered is determined by how the data is being shared. You should consider maximum risk when the data is being shared publicly without any security controls and average risk when the data is being shared through a secured portal with security controls.

There are quite a few precedents for what can be considered an acceptable amount of risk. These precedents have been used for many decades, are consistent internationally, and have persisted over time.

Managing re-identification risk means:

(1) selecting an appropriate risk metric (e.g., k-anonymity, l-diversity, t-closeness),

(2) selecting an appropriate threshold (industry standard is to set the threshold at 0.09)

(3) measuring the risk in the actual clinical trial dataset or documents that will be disclosed

Once a threshold has been determined, the actual probability of re-identification is measured in the dataset.

If the probability is higher than the threshold, transformations of the data need to be performed. These transformations may include additional equivalence class categorization and/or data redaction (documents).

Otherwise, the dataset can be declared to have an acceptable risk level for re-identification.

What about anonymized documents?

Work is ongoing within the industry to establish standards for quantitative risk assessment of anonymized and/or redacted documents.

At MMS, we have created a template where quantitative re-identification risk assessment includes a conservative threshold factor based on the uniqueness of the data in the document as compared to the underlying dataset; each variable is weighted based on the number of unique values in the dataset equivalence group divided by the number of participants in the document.

This methodology incorporates the number and uniqueness of the data in the document, compared to the overall dataset, in adjusting the overall risk of re-identification of the participants in the document.

The future of risk assessments

We continue to monitor research and industry trends associated with quantitative risk assessment. Our experts enhance and adjust our efforts in this area to provide cutting-edge solutions to quantify the risk of re-identification of clinical trial datasets and documents.

By: Veera Thota, Principal Statistical Programmer, and Harry Haber, Senior Principal Biostatistician

Learn more about MMS anonymization services here.

If you have questions about risk assessments or anonymizing data or documents, email info@mmsholdings.com for more information.

Suggested For You

perspectives

September 28th, 2023

What You Need to Know About Phase 1 Clinical Trial Designs and Bioequivalence (BE)/Bioavailability (BA) in the US and EU

perspectives

April 5th, 2022

Forever Chasing the Shiny New Thing: Thoughts from a Long-time Biostatistician

perspectives

August 12th, 2025

Optimizing Data Management for Oncology Clinical Trials: Design and Technology Best Practices

perspectives

July 23rd, 2024

PSI 2024 Ignited Conversations on External Data Sources, Requirements for Estimands, and Bayesian Methodology for Statisticians in Pharma

perspectives

June 6th, 2024

Datacise and Diversity in Patient Enrollment: Combining Geospatial and Demographic Data to Aid Site Selection

perspectives

April 29th, 2024

Validation of Clinical Dashboards for Decision Making

perspectives

December 27th, 2023

Clinical Data Science: Five Ways it Evolved from Clinical Data Management

perspectives

December 14th, 2023

Data Provenance in Real World Evidence Studies, Explained!

perspectives

October 17th, 2023

Proven Ways to Meet Key Study Start-up Timelines within Clinical Data Management

perspectives

September 25th, 2023

Clinical Data Managers Should Do These Three Things for Any Post-Production Changes

perspectives

September 8th, 2023

FDA and the Real-World: Key Changes from Draft to Final Guidance on RWD and RWE

perspectives

March 16th, 2023

10 Things to Consider When Discussing and Planning a Decentralized Clinical Trial (DCT)