31.6 bits is enough to identify you. I measured how much your accounts leak.
- • US population: 330 million. log₂(330M) = 28.3 bits. That's how much information uniquely identifies one American.
- • Sweeney (2000): {ZIP, DOB, sex} = 31.6 bits. Enough to identify 87% of Americans.
- • A typical person with 5 data broker profiles exposes 35-45 bits. Well past the uniqueness threshold.
Latanya Sweeney published a paper in 2000 that ruined any illusion of anonymity in public records. She showed that 87% of Americans can be uniquely identified by three fields: ZIP code, date of birth, and sex.
The maths behind this is straightforward. The US has about 330 million people. To uniquely identify one of them, you need log₂(330,000,000) = 28.3 bits of information. That's the threshold. Any combination of facts that exceeds 28.3 bits picks out one person.
How quasi-identifiers contribute bits
Each piece of information contributes self-information: I(x) = -log₂ p(x). The rarer the value, the more bits it contributes.
Field Frequency Bits ────────────────────────────────────────────── Sex 1/2 1.0 Birth year 1/80 6.3 Birthday (day) 1/365 8.5 ZIP code 1/43,000 15.4 ────────────────────────────────────────────── Total (if independent): 31.2 bits
31.2 bits. Against a threshold of 28.3. That's why Sweeney's result works — three mundane facts are enough to exceed the uniqueness threshold for the entire US population.
The catch: these fields aren't fully independent. ZIP code correlates with income and ethnicity. Birth year correlates weakly with name popularity. When you account for correlation, the real contribution is lower than the naive sum. But not by much — Sweeney's 87% figure holds because the correlation between DOB and ZIP is near zero.
What data brokers actually expose
I scraped my own profiles from five people-search sites. Here's what they had:
Quasi-identifier Frequency Bits Source ───────────────────────────────────────────────────────────── Full name 1/~50,000 15.6 Spokeo, WhitePages City 1/~8,000 12.9 BeenVerified Birth year 1/80 6.3 Spokeo Phone (last 4) 1/10,000 13.3 TruePeopleSearch Email domain 1/~5 2.3 breach data ───────────────────────────────────────────────────────────── Naive sum: 50.4 bits After correlation damping: ~38 bits Threshold for US pop: 28.3 bits
38 bits against a 28.3-bit threshold. Uniquely identifiable with room to spare. And this is just the people-search sites — social media accounts, breach databases, and employer directories add more.
The correlation problem
The naive sum overestimates exposure because fields aren't independent. If you know someone's full name, knowing their email domain adds less information (because the email probably contains the name). The real quantity is the conditional mutual information: I(Identity; QI_new | QIs_known).
Computing this properly requires the joint distribution P(Identity, QI_new, QIs_known). We don't have that. Nobody does — it would require knowing the exact identity of every person in the population for every combination of quasi-identifiers.
degauss uses a heuristic: dampen each new QI's contribution by a pairwise correlation factor \u03C1 estimated from population structure. \u03C1 = 0.3 between name and email (moderate overlap), \u03C1 = 0.5 between city and ZIP (high overlap), \u03C1 = 0 between DOB and phone number (independent).
// From entropy.ts
export function heuristicExposure(
newFreq: number,
correlationFactor: number = 0
): number {
const raw = selfInfo(newFreq);
return raw * (1 - Math.min(correlationFactor, 0.99));
}This is conservative — underestimating \u03C1 overestimates exposure, which is safer than the reverse. If the tool says you're at 38 bits and the true value is 34, you're still uniquely identifiable. The other direction (tool says 25, real value is 34) would give false comfort.
What actually reduces exposure
Removing individual accounts from people-search sites has limited effect. The data reappears within 30-90 days because these sites license data from upstream aggregators (LexisNexis, Acxiom). You need to cut at the source.
degauss models this as a directed graph: public records feed aggregators, aggregators feed people-search. Max-flow/min-cut (Edmonds-Karp, 1972) finds the minimum set of edges to sever. In practice: removing from LexisNexis and Acxiom has more impact than removing from ten downstream sites individually.
The greedy removal algorithm sorts attributes by efficiency: bits of exposure reduced per unit of removal difficulty. If the exposure function is submodular (diminishing returns from each removal), greedy gives at least 63% of the optimal solution. Whether the function is actually submodular for identity exposure isn't proven — the guarantee is aspirational.
Limitations
The correlation damping is the weakest part. Without the true joint distribution, every exposure estimate is approximate. The heuristic \u03C1 values come from population structure (US Census) and domain knowledge, not from a measured dataset of re-identification attacks.
Census frequency data covers the top 50 surnames and 60 first names. Out of 162,000 available. For common names the accuracy is good. For rare names (which contribute the most bits), the tool falls back to heuristic frequency estimates based on the Census distribution tail.
The uniqueness probability model — P(unique) ≈ 1 - e^(-2^(B - log₂N)) — has the right asymptotic behaviour but isn't derived from the birthday problem or any established model. It's a sigmoid-like heuristic that transitions near the threshold. Good enough for ranking, not for quoting precise probabilities.
- • Sweeney, L. (2000). “Simple Demographics Often Identify People Uniquely.” Carnegie Mellon.
- • Shannon, C.E. (1948). “A Mathematical Theory of Communication.” Bell System Technical Journal.
- • Díaz, C. et al. (2002). “Towards Measuring Anonymity.” PET 2002.
- • Fellegi, I.P. & Sunter, A.B. (1969). “A Theory for Record Linkage.” JASA 64(328).
- • Krause, A. & Golovin, D. (2014). “Submodular Function Maximization.” In: Tractability.