Writing

How much does a WHOIS match prove? Measuring attribution in bits

Giuseppe Giona·
Summary
  • • Attribution signals have measurable information content: I(x) = -log₂ p(x), where p(x) is the probability of a coincidental match.
  • • Two domains sharing Cloudflare nameservers: 2.3 bits (20% of domains use Cloudflare — DNIB, 6sense). Two domains sharing a Google Analytics ID: ~24 bits (near-definitive).
  • • Starting anonymity for the UK population: log₂(67M) ≈ 26 bits. Each signal subtracts from that total, weighted by confidence in the measurement.

Threat intelligence reports describe attribution confidence as “high”, “medium”, “low”. Two domains share a nameserver — medium confidence. Two domains share a registrant email — high confidence. These labels describe a feeling about the evidence, not a measurement of it.

I wanted actual numbers. Shannon's self-information gives them: I(x) = -log₂ p(x), where p(x) is the probability the match happened by coincidence. A 1-in-1000 coincidence carries ~10 bits. A coin flip carries 1.

26 bits of anonymity

Before any evidence, the suspect could be anyone in the UK. 67 million people. log₂(67,000,000) ≈ 26.0 bits. Each signal subtracts from that. Zero bits = identified.

The population changes everything. Same evidence against all 5.4 billion internet users? 32.3 bits — much harder. Against 2,000 UK immigration agencies? 11 bits. Three good signals and you're done.

What each signal is worth

I computed these from population base rates while building trace. The initial version had hardcoded constants — “WHOIS = 20 bits” — which turned out to be wrong in most cases. The table below shows the actual values.

Signalp(coincidence)BitsSource for base rate
Shared Cloudflare NS0.202.36sense, 2024: 20.11% DNS market share
Shared GoDaddy NS0.331.66sense, 2024: 33.13% DNS market share
Same registrar (GoDaddy)0.1392.8DNIB Q3 2025: 52.5M of 378M domains
Same registrar (Namecheap)0.0325.0DNIB Q3 2025: 11.9M of 378M domains
Shared dedicated IP≈1.3×10⁻⁶~19.5~500 domains/IP avg (arXiv:2111.00142); p = 500/378M
Shared CDN IP (Cloudflare)~1.5Hardcoded estimate; 42M sites on Cloudflare (w3techs)
IP geolocation: London9M/67M2.9ONS population estimate
IP geolocation: Bradford540K/67M7.0ONS population estimate
Matching registrant email1/N~26Email is unique; equivalent to full identification in UK pop.
Shared Google Analytics ID≈1/N~24Intentional config; ~prior minus 2 bits (could be a company with multiple domains)

Cloudflare nameservers: 2.3 bits. One in five domains would produce the same match. Barely worth recording.

A shared Google Analytics property ID, on the other hand, is ~24 bits. Someone deliberately configured both domains under the same GA account. That's not a coincidence.

The computation

The code computes these at runtime from lookup tables, not hardcoded constants. The registrar function:

// I(x) = -log₂ p(x)
function selfInfo(probability: number): number {
  if (probability <= 0 || probability >= 1) return 0
  return -Math.log(probability) / Math.LN2
}

// Registrar market shares from DNIB Q3 2025
const REGISTRAR_SHARE = {
  'godaddy': 0.139,    // 52.5M of 378M domains
  'namecheap': 0.032,  // 11.9M of 378M
  'cloudflare': 0.005,
  // ...
}

function registrarInfoGain(registrar: string): number {
  const share = REGISTRAR_SHARE[registrar.toLowerCase()]
  if (share) return selfInfo(share)
  return selfInfo(0.001)  // unknown = assume small
}

Nameservers work the same way. IPs are messier — a Cloudflare anycast IP means nothing (~1.5 bits), but a dedicated IP is almost as strong as an email match (~25 bits). The code checks the ASN first. If it's a known CDN (AS13335, AS20940, AS54113), the gain is negligible. Otherwise it branches on shared vs dedicated hosting.

Adding it up

Evidence subtracts bits from the prior. UK investigation, 26 bits. The suspect is in Bradford: that's 7 bits gone, 19 remaining. They registered through a small registrar — another 10. Down to 9. Then a matching email shows up and the rest evaporates.

Figure 1 — Bits of anonymity remaining as evidence accumulatesordering matters
identification threshold (1 bit)prior — log₂(UK pop) ≈ 26 bits051015202501234evidence sources collectedremaining anonymity (bits)24b · GA ID match20b · WHOIS email2.3b2.8b1.5b20bShannon-optimalidentified in 2 stepstotal bits = 44Naïve orderingidentified in 4 stepstotal bits = 26.6naive path detail1. CF nameservers2.3b · 20% of domains2. GoDaddy registrar2.8b · 13.9% of .com3. CF-shared IP1.5b · co-location, weak4. WHOIS email20b · the same big hit, arriving late
Anonymity set size is 2^bits. The prior is the UK population — log₂(67M) ≈ 26bits. Both paths consume real evidence pools whose information content is set by the population base rates cited on the project page (Cloudflare share 20% → 2.3 bits, GoDaddy 13.9% → 2.8 bits, GA ID near-unique → ≈24 bits, registrant email near-unique → ≈20 bits). Order matters: the Shannon-optimal path reaches the identification threshold in 2 steps; the naive path takes 4, even though the cumulative bits collected on the way there are larger. trace orders evidence by information gain descending; that’s the algorithmic reason it produces tighter reports than weighted-average resolvers.

But this is the naive sum, and it has a problem. City and ASN are correlated — a Bradford IP usually resolves to a Bradford ISP. Counting both double-counts. I don't have the joint distribution to correct for this (nobody does, publicly), so the implementation treats the sum as an upper bound on information gained. It overestimates how much anonymity has been removed, which means it underestimates remaining anonymity. For an investigator, that's the safer direction to be wrong in.

Not every signal is equally trustworthy, either. WHOIS registrant data is accurate 92% of the time when it's visible (ICANN Accuracy Reporting System Phase 2, January 2018 — the last report before GDPR paused the programme). But 73% of gTLD domains now have redacted registrant data (WhoisXML API analysis, published 2023). So the effective contribution depends on which 27% you're looking at.

IP geolocation is similar. MaxMind's GeoIP2 comparison tool reports 99.8% accuracy at country level, ~66% at city level within 50km (US addresses). I use ip-api.com, which doesn't publish accuracy numbers, so I set city confidence to 0.60 — a guess, but a conservative one. Each signal's contribution is bits × confidence. A 7-bit city at 0.60 confidence contributes 4.2 effective bits.

What this does not account for

I already mentioned the independence problem. City and ASN correlate. Registrar and nameserver provider correlate. The naive sum overcounts. You'd need the full joint distribution to correct for this, and that data doesn't exist publicly. So the tool treats the sum as an upper bound and says so in the output.

These base rates also assume an unsophisticated target. Someone using a VPN, a privacy registrar, and throwaway infrastructure won't match GoDaddy's 13.9% market share. The tool reports what the evidence says. Against a careful adversary, the evidence might say very little.

One more: WHOIS historical snapshots from before May 2018 often contain registrant data that's since been redacted. I use a reliability of 0.85 for historical records — the data was probably accurate when collected, but people change registrars.

Information gain computation and anonymity quantification: trace repo under packages/collectors/src/information-gain.ts and packages/core/src/entropy/anonymity.ts. Population base rates sourced from DNIB, 6sense, ICANN ARS, MaxMind, and WhoisXML API. Full source list in research/information-gain.md.