degauss

Measures your identity exposure in bits, maps the data broker supply chain, and computes the optimal order to remove yourself from the internet. The first version just counted accounts. Then I read Sweeney's 2000 paper and the count became meaningless.

TypeScriptShannon entropyFellegi-SunterJaro-WinklerEdmonds-Karp max-flowTor SOCKS5UK GDPR Art 17CCPA §1798.105
Source 5,659 lines · 303 tests · 11 attack scenarios

Exposure in bits, not scores

Each exposed quasi-identifier contributes I(x) = -log₂ p(x) bits. A name shared by 1 in 10,000 contributes ~13.3 bits. A name shared by 1 in 10 contributes ~3.3 bits. Sweeney (2000) showed that 31.6 bits from {ZIP, DOB, sex} uniquely identifies 87% of Americans.

degauss computes this per-person, adjusted for correlation between fields. Name correlates with ethnicity, ZIP with income. The correlation damping is heuristic — we don't have the full joint distribution. The pairwise correlation factors are estimated from population structure. Conservative: overestimates exposure, which is safer than underestimating it.

Three accounts leaking your full name, DOB, and ZIP code are worse than twenty accounts that only show a username. The entropy framework turned a vague "you're exposed" into"you're exposed by 34.2 bits, which uniquely identifies you in a population of 330 million (28.3 bits needed)."

Figure 1 — Cumulative exposure in bits as quasi-identifiers leakSweeney 2000
US: log₂(330M) = 28.3UK: log₂(67M) = 26.00510152025303540+ sex+ year of birth+ month + day+ postcode / ZIP+ surname+ first namequasi-identifiers leaked (cumulative)cumulative bits exposed↘ UK identified↘ US identifiedI(x) = −log₂ p(x)per-QI bit valuessex1.0 b≈ 0.5 eachyear of birth7.0 b~100-year rangemonth + day8.5 b365 days, near-uniformpostcode / ZIP13.3 b~10⁴ areas in dense regionssurname4.0 bcensus frequencyfirst name4.0 bgiven-name frequency
Each step adds the self-information of one quasi-identifier under independence. Sweeney (2000) found that {ZIP, DOB, sex} alone uniquely identifies 87% of Americans — ≈31.6 bits when ZIP-density variance is accounted for. The horizontal lines are the unconditional thresholds log₂(population); crossing them means the anonymity set drops below one. Bits compose assuming independence; degauss applies pairwise correlation damping at scoring time (name ↔ ethnicity, ZIP ↔ income, etc.), so the production score is slightly below this curve at the same QI count.

Probabilistic record linkage

Two broker profiles list "G. Giona, Bradford" and"Giuseppe Giona, BD18"— are these the same person? The Fellegi-Sunter model gives a principled answer. For each shared field: w(agree) = log₂(m/u), where m = P(agree | true match) and u = P(agree | coincidence). Email agreement gives ~16 bits. Name agreement gives ~6 bits. The composite weight converts to probability via sigmoid.

I spent a week getting the m-probabilities right. The initial values (0.99 for everything) produced too many false matches. The real values are lower: 0.85 for address (people move), 0.75 for job title (same role, different wording), 0.50 for IP address (changes constantly). Name comparison uses Jaro-Winkler similarity (Jaro 1989, Winkler extension) with a 0.85 threshold.

The broker supply chain

I modelled the broker ecosystem as a directed graph and immediately saw why individual removals don't stick. 21 broker nodes, 26 directed edges. Voter records flow to LexisNexis, LexisNexis feeds Spokeo. Remove from Spokeo, the data reappears in 30 days when they refresh from upstream.

The graph shows you need to cut at the aggregator level. Max-flow/min-cut (Edmonds-Karp) finds the minimum set of edges to sever. Start with LexisNexis and Acxiom, not Spokeo and WhitePages. The graph is manually curated from Senate JEC reports, CPPA enforcement actions, and reverse engineering. Real broker agreements are bilateral and not public.

From the source

Shannon entropy — H(X) = -Σ p(x) log₂ p(x)packages/core/src/quantify/entropy.ts
/** Shannon entropy of a probability distribution (bits).
 *  H(X) = -Σ p(x) log₂ p(x)
 *  Returns 0 for empty or degenerate distributions. */
export function shannonEntropy(probs: number[]): number {
  let h = 0;
  for (const p of probs) {
    if (p > 0 && p <= 1) h -= p * Math.log(p) / LN2;
  }
  return h;
}

/** Self-information (surprisal) of a specific value.
 *  I(x) = -log₂ p(x)
 *  A name shared by 1 in 10,000: ~13.3 bits.
 *  A name shared by 1 in 10: ~3.3 bits. */
export function selfInfo(frequency: number): number {
  if (frequency <= 0 || frequency >= 1) return 0;
  return -Math.log(frequency) / LN2;
}
Fellegi-Sunter field weight — log-likelihood ratio per fieldpackages/core/src/strategy/linkage.ts
/** Compute the Fellegi-Sunter linkage weight for a single field. */
export function fieldWeight(
  field: QIField,
  agrees: boolean,
  uOverride?: number
): FieldComparison {
  const m = M_PROB[field] ?? M_PROB.other;
  const u = uOverride ?? estimateFrequency(field);

  let weight: number;
  if (agrees) {
    // w(agree) = log₂(m / u)
    // agreement on a rare field gives a high positive weight
    weight = Math.log2(m / Math.max(u, 1e-15));
  } else {
    // w(disagree) = log₂((1-m) / (1-u))
    weight = Math.log2((1 - m) / Math.max(1 - u, 1e-15));
  }

  return { field, agrees, weight, mProb: m, uProb: u };
}

What it doesn't do

  • Automated scanning mostly doesn't work. Data brokers use Cloudflare to block Tor exits and automated requests. The scan returns 0 results for most targets. The scoring and planning still work if you feed JSON manually.
  • Census frequency data covers top 50 US surnames and 60 first names. The US Census publishes 162,000 surnames — I'm using a subset. Non-US populations fall back to heuristic estimates.
  • Correlation damping between quasi-identifiers is a heuristic. The real computation (conditional mutual information) requires the joint distribution P(Identity, QI_new, QIs_known), which we don't have.
  • Greedy removal ordering achieves ≥63% of optimal IF the exposure function is submodular. Submodularity is not proven for this problem. Krause & Golovin (2014) gives the guarantee generally, but our exposure function has correlation terms that may break it.
  • Social engineering feasibility scores are computed from QI coverage, not from empirical attack success rates. Directionally correct but not calibrated.

Stack

TypeScript. Zero runtime dependencies in core. Shannon entropy, min-entropy, Fellegi-Sunter record linkage, Jaro-Winkler string similarity, Edmonds-Karp max-flow, topological sort. CLI adds Tor routing and HTTP. GDPR Art 17, CCPA §1798.105, and UK DPA removal request generation.

Authorised use

Intended for measuring your own exposure, or another person’s with their consent. Aggregating personal data carries duties under UK GDPR even when each source is publicly available. Methodology at /scope.