'Anonymised' data can never be totally anonymous, says study

“Anonymised” data lies at the core of everything from modern medical research to personalised recommendations and modern AI techniques. Unfortunately, according to a paper, successfully anonymising data is practically impossible for any complex dataset.

An anonymised dataset is supposed to have had all personally identifiable information removed from it, while retaining a core of useful information for researchers to operate on without fear of invading privacy. For instance, a hospital may remove patients’ names, addresses and dates of birth from a set of health records in the hope researchers may be able to use the large sets of records to uncover hidden links between conditions.

But in practice, data can be deanonymised in a number of ways. In 2008, an anonymised Netflix dataset of film ratings was deanonymised by comparing the ratings with public scores on the IMDb film website in 2014; the home addresses of New York taxi drivers were uncovered from an anonymous data set of individual trips in the city; and an attempt by Australia’s health department to offer anonymous medical billing data could be reidentified by cross-referencing “mundane facts” such as the year of birth for older mothers and their children, or for mothers with many children.

Now researchers from Belgium’s Université catholique de Louvain (UCLouvain) and Imperial College London have built a model to estimate how easy it would be to deanonymise any arbitrary dataset. A dataset with 15 demographic attributes, for instance, “would render 99.98% of people in Massachusetts unique”. And for smaller populations, it gets easier: if town-level location data is included, for instance, “it would not take much to reidentify people living in Harwich Port, Massachusetts, a city of fewer than 2,000 inhabitants”.

Despite this, data brokers such as Experian sell “deidentified” datasets containing vastly more information per person. The researchers highlight one, sold by that company to the computer software firm Alteryx, which contained 248 attributes per household for 120 million Americans.

The researchers, led by Luc Rocher at UCLouvain, argue their results show that anonymisation is not enough for companies to get around laws such as GDPR (general data protection regulation). “Our results reject the claims that, first, reidentification is not a practical risk and, second, sampling or releasing partial datasets provide plausible deniability.

“Moving forward, they question whether current deidentification practices satisfy the anonymisation standards of modern data protection laws such as GDPR and CCPA [California consumer privacy act] and emphasise the need to move, from a legal and regulatory perspective, beyond the deidentification release-and-forget model.”

Other approaches for handling large-scale datasets might be more in line with modern data protection needs. Differential privacy, used by companies such as Apple and Uber, deliberately fuzzes every individual data point in a way that averages out across the dataset, preventing deanonymisation by reporting technically incorrect information for each person.

Homomorphic encryption involves encrypting data so it cannot be read but can still be manipulated; the results are still encrypted, but can be decrypted once returned to the data controller. And at the far end, synthetic datasets involve training an AI on real, identifiable information, then using it to generate new, fake data points that are statistically identical but do not relate to any real individual.

The research is published in the journal Nature Communications.