Differential Privacy: How Big Tech Studies You Without Studying You

Most privacy-preserving techniques are about restricting access to data: don’t share it, encrypt it, hash it, anonymise it. Differential privacy is a different idea entirely. It is a mathematical framework that lets you compute statistics over a dataset and publish the results, while guaranteeing, with provable bounds, that the published results say almost nothing about any single individual in the dataset.

This is the rare privacy technology that comes with a real theorem rather than just promises. Understanding what the theorem actually says, and what it does not, is essential to evaluating where differential privacy delivers and where it is over-claimed.

The basic idea

Imagine a database of medical records and a query that asks "what fraction of patients in this database have diabetes?" The answer is a single number. Differential privacy adds carefully calibrated random noise to that number before publishing it.

The guarantee, formalised by Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith in their 2006 paper "Calibrating Noise to Sensitivity in Private Data Analysis," is this: the noisy answer would be approximately the same whether or not any specific individual were in the database. Anyone looking at the output cannot tell, with confidence, whether you were a participant. Your privacy is therefore preserved regardless of what auxiliary information the adversary has.

The mathematical statement is: a randomised algorithm M is ε-differentially private if, for any two databases D and D’ that differ in a single record, and any output S, the probability that M(D) returns something in S is at most e^ε times the probability that M(D’) returns something in S.

The parameter ε (epsilon) is the privacy budget. Smaller ε means stronger privacy and noisier outputs. Larger ε means cleaner outputs and weaker privacy. The choice of ε is the central engineering decision.

Why this is non-obvious

The intuitive privacy idea is "remove names from the data and we are safe." Twenty years of re-identification research has shown that this fails repeatedly. The 2006 Netflix Prize anonymised dataset was re-identified by combining it with public IMDb data. The Massachusetts hospital records de-identified by the GIC were re-identified by Latanya Sweeney using public voter rolls. AOL’s anonymised search logs were re-identified within hours of release.

The pattern is: anonymisation that does not change the values of individual records leaks information about those records, and the leakage compounds when the data is combined with auxiliary sources.

Differential privacy makes a stronger guarantee. It does not rely on the adversary having limited auxiliary information. Even an adversary with complete knowledge of every individual in the database except one cannot, after seeing the differentially private output, learn meaningfully more about that one. The randomness is the protection.

The two main settings

Central differential privacy. The original setting. A trusted curator (the company, the statistics office) holds the raw data and applies noise before publishing aggregates. Apple does not use this; Google uses it for some internal analyses; the US Census Bureau uses it for the 2020 Decennial Census public-use data.

Local differential privacy. The user’s device adds noise before sending data to the company at all. The company never sees clean values. Stronger trust model, the company cannot reconstruct individual data even if compromised, at the cost of much higher noise. This is what Apple uses for its on-device telemetry.

The trade-off is fundamental. Local DP requires either much larger datasets or much higher ε to produce useful aggregates than central DP for the same noise budget.

Real-world deployments

Apple. Local differential privacy in iOS for keyboard usage, emoji frequency, certain Safari telemetry, and a few other features. Apple’s published ε values are higher than academics consider strict (the original deployments were criticised for ε per query in the range of 4 to 8 per data type per day, with cumulative budgets unclear). Apple’s Differential Privacy Overview is at apple.com/privacy/docs/Differential_Privacy_Overview.pdf.

Google RAPPOR. Local DP in Chrome for browser-statistics collection, deployed since 2014. Open-sourced, well-documented, used as a teaching example. Replaced over time by other Google deployments using newer techniques.

Google’s COVID-19 Mobility Reports. Central DP applied to aggregated location data, providing a useful public-health dataset without individual-level location disclosure. Documentation explains the noise calibration.

US Census Bureau. The 2020 Decennial Census uses differential privacy at a scale never attempted before. The "Disclosure Avoidance System" applies central DP across the published tables. The deployment has been controversial, small-population demographics are noisier than under prior swap-based approaches, but represents the most consequential public-statistics deployment of DP. Methodology documentation at census.gov/programs-surveys/decennial-census/decade/2020/planning-management/process/disclosure-avoidance.html.

LinkedIn. DP applied to the audience-engagement statistics shown to advertisers and content creators.

Microsoft. DP applied internally for certain telemetry analyses; published academic papers describe the deployments.

Where the guarantee holds, and where it does not

Differential privacy gives a strong guarantee about what an adversary can learn from the published output. It says nothing about:

Data that is collected and held but not published. Apple’s local DP protects what flies over the wire to Apple servers; raw data still on the device is unprotected.

Side channels. Network metadata, timing, account associations.

The composition of multiple queries. Each query consumes some privacy budget. After enough queries, the cumulative budget exceeds tolerable thresholds and the guarantee weakens. This is why deployments must track cumulative ε across all queries against the same individuals.

Choices about what is private at all. DP tells you that a specific aggregate statistic is private; it does not tell you whether the choice to publish that statistic at all is appropriate. The Census disputes are largely about this, DP correctly applied still permits publication of fine-grained demographic breakdowns that some communities consider sensitive.

The choice of ε. Real deployments have ε ranging from 0.1 (academic strict) to 10+ (some real production systems). The privacy meaning at high ε is much weaker than at low ε; comparing deployments without comparing budgets misses the point.

The state in 2026

Differential privacy is no longer a niche academic technique. It is in production at multiple companies, in the largest public statistics release in the United States, and in privacy-preserving machine learning frameworks.

It is also still substantially harder to deploy than to describe. Choosing the right algorithm, calibrating ε to operational utility, accounting for budget across queries and over time, and explaining the trade-offs to non-technical stakeholders are all genuine engineering challenges.

The leading open-source libraries, Google’s differential-privacy library, OpenDP from Harvard and Microsoft, IBM’s diffprivlib, Tumult Analytics, make the algorithms accessible. The frameworks for applying them at organisational scale are still maturing.

For privacy-aware consumers, differential privacy is a feature to look for in privacy-respecting technologies, not a magic word. When a company says "we use differential privacy," reasonable follow-up questions are: ε per query? cumulative ε per user? local or central? what budget tracking? Has the deployment been independently evaluated?

Apple’s, Google’s, and Census’s deployments stand up to those questions to varying degrees. Many smaller deployments do not. The math is real; the deployment quality varies.

The deeper significance is that differential privacy demonstrates the existence proof: privacy-preserving aggregate analysis is possible. The economic and political incentives to apply it have been growing. Whether the next decade sees DP become a standard tool of large-scale data analysis or remain a niche technique used by a handful of organisations with the engineering depth to deploy it correctly is the open question.

Differential Privacy: How Big Tech Studies You Without Studying You

Top infostealer families in 2026: Lumma, RedLine, Vidar, StealC, and the new entrants

Stealer logs explained: what they hold, how they leak, and how to check yours

Ransomware ditched encryption in May 2026 — here’s why

Differential Privacy: How Big Tech Studies You Without Studying You

The basic idea

Why this is non-obvious

The two main settings

Real-world deployments

Where the guarantee holds, and where it does not

The state in 2026

Related Posts

Top infostealer families in 2026: Lumma, RedLine, Vidar, StealC, and the new entrants

Stealer logs explained: what they hold, how they leak, and how to check yours

Ransomware ditched encryption in May 2026 — here’s why