The issue of privacy arises in each of our interactions with digital platforms that collect and share user data, sometimes without explicit consent. The rise of machine learning is based on training collected data that may contain personal data, particularly in areas such as health or insurance. Once the models have been trained, there is always a risk of re-identification, especially in the context of a cyber-attack. Recent experiments have shown that data re-identification can be done quite simply in the context of machine learning. In general, the cross-referencing of data and data parsimony in the digital space enable malicious attempts at de-anonymization and re-identification. Preserving the anonymity and privacy of users who provide data throughout data collection and analysis is a major challenge for designers of data processing systems – who can now turn to differential privacy (DP) technologies.
Differential Privacy preserves data anonymization
Introduced in 2006, the concept of differential privacy relies on methods that protect personal data against the risk of re-identification while keeping query results relevant. At the crossroads of several mathematical disciplines – data science, optimization, probability, cryptography – differential privacy allows the statistical exploitation of aggregated individual data without affecting the privacy of the individuals concerned. The general idea came from Cynthia Dwork’s work.[i] Differential privacy is achieved by applying a process that introduces randomness into the data while maintaining their operational potential.
Consider the classic example of a differentially private algorithm. Assume we are trying to estimate the proportion of drug users in a given population. The traditional approach would be to ask a representative sample of the population directly. The major flaw of this direct method is that a surveyed individual’s response compromises their privacy. A ‘differential privacy’ approach is based on the following process. For each person interviewed, a coin is tossed. If the coin lands on heads, the person answers truthfully; if it lands on tails, a second coin is flipped and, based on the toss, the survey question will be randomly answered. If the coin lands on heads, the answer will be ‘Yes, I am a consumer’; if it lands on tails, ‘No, I am not a consumer’. This process allows individuals to deny their answer and claim it was due to chance. If the pollster uses a fairly large sample, they can easily get a reliable estimate of the proportion of drug users based on the frequency of positive answers they observe.
This example highlights several fundamental properties of the differential privacy concept. The first property (positive) is robustness to post-processing: it is not possible to compromise the privacy of the surveyed individual by analyzing their answer. Another property (negative) is composition. Intuitively, if the aforementioned survey is repeated on the same person a hundred times, a reliable estimate of their true answer can be obtained. Finally, the third property (positive), a remarkable one, is sub-sampling. If an individual’s probability of being included in the study is strictly lower than one, then their privacy is better preserved.
Noise contributes to data privacy
Differential privacy can be achieved by adding random noise to an aggregated query result to protect individual inputs without significantly changing the result. Differentially private algorithms ensure that the attacker can barely learn anything more about an individual than they would if that person’s file was not part of the data set. One of the simplest algorithms is based on Laplace’s method, which allows to post-process the results of aggregated queries. Apple and Google use differential privacy techniques in iOS and Chrome respectively. Google recently released an open-source version of the differential privacy library[ii] used by some of its products. The library was designed to help developers create products that use anonymized aggregated data in a way that preserves privacy. Differentially private algorithms have also been implemented in analysis products that preserve privacy, such as the solutions developed by Privitar.
Designing an algorithm with a differential privacy property is not always possible. When the algorithm gives a deterministic response based on data, it is generally impossible to make it differentially private without changing the format of the answer. The solution is to introduce random noise into the returned answer.
Differential privacy provides a strong guarantee of anonymity as it applies to an algorithm, not to an outcome. This is what makes this process powerful, although it remains complex to implement. Indeed, adding noise tends to degrade a model’s performance. It is therefore necessary to find a subtle balance when building the underlying algorithm. In addition, it cannot be ensured that a model is differentially private without having access to the algorithm that built it.
Major editors are preparing to deploy solutions that integrate differential privacy by design into their data processing methods. Let us hope this trend will become a standard!
(by Thierry Berthier, Saint-Cyr Military School Research Centre (CREC) and Saint-Cyr Chair)
[i] Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Theory of cryptography conference. pp. 265-284. Springer (2006).
[ii] Google Differential Privacy Library: