How much does a WHOIS match prove? Measuring attribution in bits
- • Attribution signals have measurable information content: I(x) = -log₂ p(x), where p(x) is the probability of a coincidental match.
- • Two domains sharing Cloudflare nameservers: 2.3 bits (20% of domains use Cloudflare — DNIB, 6sense). Two domains sharing a Google Analytics ID: ~24 bits (near-definitive).
- • Starting anonymity for the UK population: log₂(67M) ≈ 26 bits. Each signal subtracts from that total, weighted by confidence in the measurement.
Threat intelligence reports describe attribution confidence as “high”, “medium”, “low”. Two domains share a nameserver — medium confidence. Two domains share a registrant email — high confidence. These labels describe a feeling about the evidence, not a measurement of it.
I wanted actual numbers. Shannon's self-information gives them: I(x) = -log₂ p(x), where p(x) is the probability the match happened by coincidence. A 1-in-1000 coincidence carries ~10 bits. A coin flip carries 1.
26 bits of anonymity
Before any evidence, the suspect could be anyone in the UK. 67 million people. log₂(67,000,000) ≈ 26.0 bits. Each signal subtracts from that. Zero bits = identified.
The population changes everything. Same evidence against all 5.4 billion internet users? 32.3 bits — much harder. Against 2,000 UK immigration agencies? 11 bits. Three good signals and you're done.
What each signal is worth
I computed these from population base rates while building trace. The initial version had hardcoded constants — “WHOIS = 20 bits” — which turned out to be wrong in most cases. The table below shows the actual values.
| Signal | p(coincidence) | Bits | Source for base rate |
|---|---|---|---|
| Shared Cloudflare NS | 0.20 | 2.3 | 6sense, 2024: 20.11% DNS market share |
| Shared GoDaddy NS | 0.33 | 1.6 | 6sense, 2024: 33.13% DNS market share |
| Same registrar (GoDaddy) | 0.139 | 2.8 | DNIB Q3 2025: 52.5M of 378M domains |
| Same registrar (Namecheap) | 0.032 | 5.0 | DNIB Q3 2025: 11.9M of 378M domains |
| Shared dedicated IP | ≈1.3×10⁻⁶ | ~19.5 | ~500 domains/IP avg (arXiv:2111.00142); p = 500/378M |
| Shared CDN IP (Cloudflare) | — | ~1.5 | Hardcoded estimate; 42M sites on Cloudflare (w3techs) |
| IP geolocation: London | 9M/67M | 2.9 | ONS population estimate |
| IP geolocation: Bradford | 540K/67M | 7.0 | ONS population estimate |
| Matching registrant email | 1/N | ~26 | Email is unique; equivalent to full identification in UK pop. |
| Shared Google Analytics ID | ≈1/N | ~24 | Intentional config; ~prior minus 2 bits (could be a company with multiple domains) |
Cloudflare nameservers: 2.3 bits. One in five domains would produce the same match. Barely worth recording.
A shared Google Analytics property ID, on the other hand, is ~24 bits. Someone deliberately configured both domains under the same GA account. That's not a coincidence.
The computation
The code computes these at runtime from lookup tables, not hardcoded constants. The registrar function:
// I(x) = -log₂ p(x)
function selfInfo(probability: number): number {
if (probability <= 0 || probability >= 1) return 0
return -Math.log(probability) / Math.LN2
}
// Registrar market shares from DNIB Q3 2025
const REGISTRAR_SHARE = {
'godaddy': 0.139, // 52.5M of 378M domains
'namecheap': 0.032, // 11.9M of 378M
'cloudflare': 0.005,
// ...
}
function registrarInfoGain(registrar: string): number {
const share = REGISTRAR_SHARE[registrar.toLowerCase()]
if (share) return selfInfo(share)
return selfInfo(0.001) // unknown = assume small
}Nameservers work the same way. IPs are messier — a Cloudflare anycast IP means nothing (~1.5 bits), but a dedicated IP is almost as strong as an email match (~25 bits). The code checks the ASN first. If it's a known CDN (AS13335, AS20940, AS54113), the gain is negligible. Otherwise it branches on shared vs dedicated hosting.
Adding it up
Evidence subtracts bits from the prior. UK investigation, 26 bits. The suspect is in Bradford: that's 7 bits gone, 19 remaining. They registered through a small registrar — another 10. Down to 9. Then a matching email shows up and the rest evaporates.
2^bits. The prior is the UK population — log₂(67M) ≈ 26bits. Both paths consume real evidence pools whose information content is set by the population base rates cited on the project page (Cloudflare share 20% → 2.3 bits, GoDaddy 13.9% → 2.8 bits, GA ID near-unique → ≈24 bits, registrant email near-unique → ≈20 bits). Order matters: the Shannon-optimal path reaches the identification threshold in 2 steps; the naive path takes 4, even though the cumulative bits collected on the way there are larger. trace orders evidence by information gain descending; that’s the algorithmic reason it produces tighter reports than weighted-average resolvers.But this is the naive sum, and it has a problem. City and ASN are correlated — a Bradford IP usually resolves to a Bradford ISP. Counting both double-counts. I don't have the joint distribution to correct for this (nobody does, publicly), so the implementation treats the sum as an upper bound on information gained. It overestimates how much anonymity has been removed, which means it underestimates remaining anonymity. For an investigator, that's the safer direction to be wrong in.
Not every signal is equally trustworthy, either. WHOIS registrant data is accurate 92% of the time when it's visible (ICANN Accuracy Reporting System Phase 2, January 2018 — the last report before GDPR paused the programme). But 73% of gTLD domains now have redacted registrant data (WhoisXML API analysis, published 2023). So the effective contribution depends on which 27% you're looking at.
IP geolocation is similar. MaxMind's GeoIP2 comparison tool reports 99.8% accuracy at country level, ~66% at city level within 50km (US addresses). I use ip-api.com, which doesn't publish accuracy numbers, so I set city confidence to 0.60 — a guess, but a conservative one. Each signal's contribution is bits × confidence. A 7-bit city at 0.60 confidence contributes 4.2 effective bits.
What this does not account for
I already mentioned the independence problem. City and ASN correlate. Registrar and nameserver provider correlate. The naive sum overcounts. You'd need the full joint distribution to correct for this, and that data doesn't exist publicly. So the tool treats the sum as an upper bound and says so in the output.
These base rates also assume an unsophisticated target. Someone using a VPN, a privacy registrar, and throwaway infrastructure won't match GoDaddy's 13.9% market share. The tool reports what the evidence says. Against a careful adversary, the evidence might say very little.
One more: WHOIS historical snapshots from before May 2018 often contain registrant data that's since been redacted. I use a reliability of 0.85 for historical records — the data was probably accurate when collected, but people change registrars.
Information gain computation and anonymity quantification: trace repo under packages/collectors/src/information-gain.ts and packages/core/src/entropy/anonymity.ts. Population base rates sourced from DNIB, 6sense, ICANN ARS, MaxMind, and WhoisXML API. Full source list in research/information-gain.md.