Writing

31.6 bits is enough to identify you. I measured how much your accounts leak.

Giuseppe Giona·
Key findings
  • • US population: 330 million. log₂(330M) = 28.3 bits. That's how much information uniquely identifies one American.
  • • Sweeney (2000): {ZIP, DOB, sex} = 31.6 bits. Enough to identify 87% of Americans.
  • • A typical person with 5 data broker profiles exposes 35-45 bits. Well past the uniqueness threshold.

Latanya Sweeney published a paper in 2000 that ruined any illusion of anonymity in public records. She showed that 87% of Americans can be uniquely identified by three fields: ZIP code, date of birth, and sex.

The maths behind this is straightforward. The US has about 330 million people. To uniquely identify one of them, you need log₂(330,000,000) = 28.3 bits of information. That's the threshold. Any combination of facts that exceeds 28.3 bits picks out one person.

How quasi-identifiers contribute bits

Each piece of information contributes self-information: I(x) = -log₂ p(x). The rarer the value, the more bits it contributes.

Field               Frequency       Bits
──────────────────────────────────────────────
Sex                 1/2             1.0
Birth year          1/80            6.3
Birthday (day)      1/365           8.5
ZIP code            1/43,000        15.4
──────────────────────────────────────────────
Total (if independent):             31.2 bits

31.2 bits. Against a threshold of 28.3. That's why Sweeney's result works — three mundane facts are enough to exceed the uniqueness threshold for the entire US population.

The catch: these fields aren't fully independent. ZIP code correlates with income and ethnicity. Birth year correlates weakly with name popularity. When you account for correlation, the real contribution is lower than the naive sum. But not by much — Sweeney's 87% figure holds because the correlation between DOB and ZIP is near zero.

What data brokers actually expose

I scraped my own profiles from five people-search sites. Here's what they had:

Quasi-identifier    Frequency       Bits    Source
─────────────────────────────────────────────────────────────
Full name           1/~50,000       15.6    Spokeo, WhitePages
City                1/~8,000        12.9    BeenVerified
Birth year          1/80             6.3    Spokeo
Phone (last 4)      1/10,000        13.3    TruePeopleSearch
Email domain        1/~5             2.3    breach data
─────────────────────────────────────────────────────────────
Naive sum:                          50.4 bits
After correlation damping:          ~38 bits
Threshold for US pop:               28.3 bits

38 bits against a 28.3-bit threshold. Uniquely identifiable with room to spare. And this is just the people-search sites — social media accounts, breach databases, and employer directories add more.

The correlation problem

The naive sum overestimates exposure because fields aren't independent. If you know someone's full name, knowing their email domain adds less information (because the email probably contains the name). The real quantity is the conditional mutual information: I(Identity; QI_new | QIs_known).

Computing this properly requires the joint distribution P(Identity, QI_new, QIs_known). We don't have that. Nobody does — it would require knowing the exact identity of every person in the population for every combination of quasi-identifiers.

degauss uses a heuristic: dampen each new QI's contribution by a pairwise correlation factor \u03C1 estimated from population structure. \u03C1 = 0.3 between name and email (moderate overlap), \u03C1 = 0.5 between city and ZIP (high overlap), \u03C1 = 0 between DOB and phone number (independent).

// From entropy.ts
export function heuristicExposure(
  newFreq: number,
  correlationFactor: number = 0
): number {
  const raw = selfInfo(newFreq);
  return raw * (1 - Math.min(correlationFactor, 0.99));
}

This is conservative — underestimating \u03C1 overestimates exposure, which is safer than the reverse. If the tool says you're at 38 bits and the true value is 34, you're still uniquely identifiable. The other direction (tool says 25, real value is 34) would give false comfort.

What actually reduces exposure

Removing individual accounts from people-search sites has limited effect. The data reappears within 30-90 days because these sites license data from upstream aggregators (LexisNexis, Acxiom). You need to cut at the source.

degauss models this as a directed graph: public records feed aggregators, aggregators feed people-search. Max-flow/min-cut (Edmonds-Karp, 1972) finds the minimum set of edges to sever. In practice: removing from LexisNexis and Acxiom has more impact than removing from ten downstream sites individually.

The greedy removal algorithm sorts attributes by efficiency: bits of exposure reduced per unit of removal difficulty. If the exposure function is submodular (diminishing returns from each removal), greedy gives at least 63% of the optimal solution. Whether the function is actually submodular for identity exposure isn't proven — the guarantee is aspirational.

Limitations

The correlation damping is the weakest part. Without the true joint distribution, every exposure estimate is approximate. The heuristic \u03C1 values come from population structure (US Census) and domain knowledge, not from a measured dataset of re-identification attacks.

Census frequency data covers the top 50 surnames and 60 first names. Out of 162,000 available. For common names the accuracy is good. For rare names (which contribute the most bits), the tool falls back to heuristic frequency estimates based on the Census distribution tail.

The uniqueness probability model — P(unique) ≈ 1 - e^(-2^(B - log₂N)) — has the right asymptotic behaviour but isn't derived from the birthday problem or any established model. It's a sigmoid-like heuristic that transitions near the threshold. Good enough for ranking, not for quoting precise probabilities.

References
  • • Sweeney, L. (2000). “Simple Demographics Often Identify People Uniquely.” Carnegie Mellon.
  • • Shannon, C.E. (1948). “A Mathematical Theory of Communication.” Bell System Technical Journal.
  • • Díaz, C. et al. (2002). “Towards Measuring Anonymity.” PET 2002.
  • • Fellegi, I.P. & Sunter, A.B. (1969). “A Theory for Record Linkage.” JASA 64(328).
  • • Krause, A. & Golovin, D. (2014). “Submodular Function Maximization.” In: Tractability.