Should de-identified data still be considered personal data?

The February 26, 2020 report by Waterfront Toronto’s Digital Strategy Advisory Panel (DSAP) indicated a debate within the panel about whether de-identified (so-called anonymized) data should still be considered personal data. This is an important issue because de-identification is often regarded as a key technique to protect privacy.

De-identification involves removing or changing collected information to make it less likely to identify an individual. For example, a person’s birthdate might be removed from a health record. Or, a person’s name might be removed from a data record and replaced with a random number. The hope is that by removing this identifying information, the data can still be used for many purposes, while privacy is protected.

However, there are many situations that have been found where combining the de-identified data with other information that is in the public domain, or which can be purchased from data brokers, allows personal data to be re-identified.

There are other challenges inherent in using de-identification to protect privacy. As more information is removed from a data set (making the de-identification more effective), the data set becomes less useful. So there is a trade-off between protecting privacy and utility. Another significant challenge is that as the dataset is used to answer more queries or produce more reports, the protections provided by de-identification start to erode. The answer to each query against the de-identified data can be a source of new information that can be used to re-identify the personal information. The Fundamental Law of Information Recovery states that every statistic computed from a dataset (database) (i.e. an answer to a query on the data set) leaks a small amount of information about each member of the dataset. Put simply, no “anonymized” and useful dataset can be guaranteed to protect privacy.

So it is probably good to make a default assumption that de-identified data is still personal data. Some other approaches can also help protect the privacy in de-identified data. Raw datasets (i.e. line item data as collected) (even if anonymized) should rarely, if ever, be publicly released. Finally, a data custodian should keep track of the queries answered about de-identified data, to allow ongoing assessment of privacy risks.

Submit a Comment Cancel reply

RECENT POSTS