Differential Privacy Explained: The Math That Lets Apple and Google Watch You Less

Differential privacy is the only formal definition of privacy that survives contact with real adversaries. It promises that the output of an analysis would have looked essentially the same whether or not you were in the dataset — and it backs that promise with mathematics. Here is how that works, and why the ε in the fine print matters more than the headline.

Cynthia Dwork and her co-authors at Microsoft Research introduced differential privacy in 2006. The motivating problem was simple: aggregate statistics released about a population — average income by ZIP code, common search terms, frequency of a medical condition — were repeatedly being shown to leak information about specific individuals. The Netflix Prize de-anonymization (Narayanan and Shmatikov, 2008) and the Massachusetts hospital records case (Sweeney, 1997) both demonstrated that "anonymized" datasets routinely re-identify under cross-referencing.

Earlier attempts at privacy — k-anonymity, l-diversity, t-closeness — were patches. Each had elegant counterexamples within a few years of publication. Differential privacy took a different approach: instead of trying to protect individuals after the fact, it gave a mathematical definition of what "private" means for an algorithm, and built mechanisms that provably satisfy it.

The Definition, in Plain Language

An algorithm is ε-differentially private if its output is statistically indistinguishable when you add or remove any single individual's data from the input. More precisely: the probability of any output occurring on a dataset that includes you, divided by the probability of the same output occurring on the same dataset without you, is bounded by e^ε.

The promise

If you are deciding whether to contribute your data to a study with ε-differential privacy, you can be confident that anything an attacker could learn about you from the output is at most e^ε times more than they would have learned if your data had been excluded entirely. The smaller ε is, the closer to "no additional learning" you get.

The genius of the framing: differential privacy makes no assumption about what an attacker knows. It does not require that the data be anonymized, or that the attacker not have a side dataset, or that the attacker not be exceedingly clever. It says regardless of all that, your inclusion in the dataset cannot have caused much change in what the output reveals.

The Mechanism: Calibrated Noise

The most common way to achieve differential privacy is the Laplace mechanism. Suppose you want to release the count of how many people in a survey checked a sensitive box. The exact count, by itself, is not private — if I know everyone's answer except one person's, the published count tells me theirs.

The Laplace mechanism adds random noise to the count. The noise is drawn from a Laplace distribution with scale parameter Δf/ε, where Δf is the sensitivity of the function — how much the output could change if you swapped one individual's data. For a count, sensitivity is 1: any one person can change the count by at most 1.

The result: the published count is the true count plus a small random offset. For typical ε values, that offset is a handful — useful for population-scale statistics, but enough to obscure any one person's contribution. The smaller ε is (more privacy), the larger the noise. The larger ε is (less privacy), the smaller the noise and the more accurate the statistic.

The ε-Budget Is the Whole Story

Differential privacy is not binary. The parameter ε is the entire game. A system advertised as "differentially private" without disclosing its ε is making no meaningful claim. Researchers in the field have come to rough informal categories:

ε	Interpretation	Practical reading
0.1	Very strong privacy	Output reveals almost nothing about individuals. Often too noisy to be useful.
1.0	Strong, balanced	Typical academic benchmark. e¹ ≈ 2.7× shift in inference.
3	Moderate	e³ ≈ 20× shift. Common in deployed systems.
10+	Weak	e¹⁰ ≈ 22,000×. Mathematical guarantee survives but its practical force is thin.

Apple's iOS uses local differential privacy with ε reported in various analyses as somewhere in the 2 to 8 range per query (Apple has not been fully transparent here, and academic measurements have varied). The U.S. Census Bureau used ε ≈ 19.6 across the entire 2020 disclosure. Both are differentially private. Both make significantly different practical promises.

Local vs Central Differential Privacy

There are two deployment models, and which one a system uses matters as much as ε.

Central differential privacy assumes the data collector is trusted with raw data. Individuals send their unmodified records to the collector, who runs a differentially private mechanism over the aggregate before publishing anything. This produces less noise per query but requires trusting the collector not to misuse the raw data it holds.

Local differential privacy assumes the collector is untrusted. Each individual perturbs their own data before sending it. The collector never sees the raw values. Apple's deployments are largely local; Google's RAPPOR (used in early Chrome telemetry) is local. The mathematical penalty is heavy — you need vastly more samples to get useful aggregate statistics — but the trust assumption is much weaker.

What Apple Actually Does

Apple introduced differential privacy in iOS 10 (2016) and has expanded the deployment since. Specific use cases publicly disclosed include:

Identifying which emoji are most popular without learning what any individual user types
Discovering trending new words in QuickType to add to the autocomplete dictionary
Identifying which websites use disproportionate energy in Safari
Health-related typing data for autocomplete in Health app fields

The implementation is local DP with per-event ε budgeted across categories. Apple publishes a technical overview but has been criticized in academic analyses for ε values that are higher than implied in marketing and for combining multiple queries that accumulate privacy loss faster than presented. Their cumulative daily privacy budget is the operative number, not the per-query ε.

What Google Does

Google has deployed differential privacy across more products than any other major company. Notable cases:

COVID-19 Community Mobility Reports published throughout the pandemic — DP applied to Maps data showing aggregate movement trends
Chrome's RAPPOR (Randomized Aggregatable Privacy-Preserving Ordinal Responses), deployed 2014, for collecting browser usage statistics
Google Search trends for sensitive query categories
Federated learning in Gboard combines on-device training with DP aggregation

Google has published more detailed analysis than Apple, including open-source DP libraries (the Google DP team's differential-privacy GitHub project) and concrete ε values for specific releases. They have also drawn academic criticism for some deployments, particularly around the cumulative budget when running many queries against the same dataset.

What Differential Privacy Does Not Do

DP is a powerful tool, but it answers a specific question: can an adversary distinguish whether you were in this dataset? It does not address:

Inferences about you from data you did contribute. If you tell a study your blood pressure, DP can hide that you specifically were in the study, but it cannot prevent the study from publishing the average blood pressure of people with your demographic profile, which a smart attacker might still combine with other knowledge.
Re-identification within the dataset. DP protects the analysis output. The underlying dataset, if leaked separately, is no more anonymous than it ever was.
Privacy when the data collector itself is the adversary. Central DP requires trusting the collector. If the collector is the threat, you need local DP.

DP is also vulnerable to misapplication. Composition rules — what happens when you apply multiple DP mechanisms to the same data — are subtle and have caught teams off-guard. The total ε across N independent queries against the same dataset is N · ε in the simple case, which means an unbounded query budget destroys the guarantee.

Differential privacy is the only privacy definition that holds up against an unbounded adversary with unlimited side information. It is also the only privacy definition that comes with a numerical knob that directly trades off how much privacy you have for how useful the output is. Both of those properties make it the right tool for aggregate statistics — and the wrong tool for anything else.

Where This Touches Your Life

For most users, DP runs in the background. If you use iOS or Chrome, parts of your interaction telemetry are passed through a DP mechanism before leaving your device. If you contribute to a study at Microsoft Research or Google AI, your record is increasingly likely to be processed with DP guarantees. The U.S. Census 2020 data you can download is differentially private.

What you can do as a user: look for ε values when a company claims "differential privacy." A specific number is a meaningful claim. The phrase alone is marketing.

Where Haven Fits

Haven does not currently use differential privacy for any user-facing claim, because we are not in the aggregate-statistics business. Our privacy model is upstream: we minimize what we collect in the first place, and what we do collect is end-to-end encrypted between you and your contacts. The server holds ciphertext.

DP shines for systems that genuinely need population-level statistics — usage trends, error reporting, language modeling — without learning about individuals. For messaging, the right answer is closer to "the server cannot read the messages at all," which is what end-to-end encryption provides.