trace

Attribution investigation tool. Given a target domain, email, or set of reviews, trace collects signals from public data sources and quantifies how much anonymity the subject retains. I built it because I run businesses in a sector where fake reviews are common and I wanted the capability ready before I needed it — evidence that holds up in court, not a vague suspicion.

TypeScriptDempster-ShaferShannon entropyFellegi-SunterJaro-WinklerRDAP (RFC 9082)crt.shSHA-256 chainRFC 3161

Source 421 tests · 12 papers cited · 12 UK statutes

The problem

Someone leaves three one-star reviews on your business in 24 hours. Different accounts, different names. You suspect a competitor but you can't prove it. Google won't tell you who posted them. The CMA can investigate fake reviews under the DMCC Act 2024 but they need evidence. Your solicitor can apply for a Norwich Pharmacal order but the court needs to see that the underlying claim has substance.

trace produces that evidence. It collects signals from 8+ public data sources, fuses them using Dempster-Shafer theory, quantifies anonymity reduction in bits, and generates a forensic report that cites the applicable legislation and states the error rate for each module.

How attribution works

Each data source contributes information measured in bits. A matching registrant email contributes ~20 bits (near-unique). Shared Cloudflare nameservers contribute 2.3 bits (20% of all domains use Cloudflare — nearly meaningless). A matching Google Analytics ID contributes ~24 bits (unique per account). The anonymity set starts at log₂(67M) = 26 bits for the UK population and shrinks with each observation.

When evidence conflicts — WHOIS says Company X but stylometry says different person — a weighted average gives a misleading 0.7 confidence. Dempster-Shafer detects the conflict (mass K) and reports high uncertainty instead of a false consensus. The report says "sources disagree, investigate further"rather than a number that hides the disagreement.

Figure 1 — Bits of anonymity remaining as evidence accumulatesordering matters

Anonymity set size is 2^bits. The prior is the UK population — log₂(67M) ≈ 26bits. Both paths consume real evidence pools whose information content is set by the population base rates cited on the project page (Cloudflare share 20% → 2.3 bits, GoDaddy 13.9% → 2.8 bits, GA ID near-unique → ≈24 bits, registrant email near-unique → ≈20 bits). Order matters: the Shannon-optimal path reaches the identification threshold in 2 steps; the naive path takes 4, even though the cumulative bits collected on the way there are larger. trace orders evidence by information gain descending; that’s the algorithmic reason it produces tighter reports than weighted-average resolvers.

Seven attribution layers

Domain intelligence: WHOIS via RDAP (RFC 9082), reverse WHOIS, historical records. Certificate transparency: crt.sh, 14 billion certificates. DNS: all record types, nameserver correlation, verification token extraction. HTTP headers: platform fingerprinting, tracking ID extraction. Email forensics: RFC 5322 header parsing, routing chain, authentication results. IP geolocation: country, city, ASN, hosting/VPN classification. Writing analysis: stylometric features, AI text detection, authorship comparison.

Cross-domain correlation links domains through shared infrastructure. The strength of each link is weighted by the inverse frequency of the shared attribute. Two domains sharing a Cloudflare IP scores 1.5 bits. Two domains sharing a dedicated IP scores 19.5 bits. Two domains sharing a Google Analytics ID scores 24 bits.

Calibration from published data

Every reliability parameter cites the study it was derived from. WHOIS reliability (0.92) comes from ICANN's Accuracy Reporting System Phase 2, 2018. CT log completeness (0.87) comes from Li et al., CCS 2019. IP geolocation accuracy (0.95 country, 0.60 city) comes from MaxMind's published comparison tool. Stylometry confidence scales with text length: 0.75 at 200+ words (Abbasi& Chen 2008), down to 0.15 below 50 words (literature consensus).

Information gain values are computed from population base rates. 378.5 million registered domains (DNIB Q3 2025). GoDaddy holds 13.9% of .com domains — knowing the registrar is GoDaddy gives 2.8 bits. Namecheap holds 3.2% — 5.0 bits. A small registrar at 0.1% gives 10.0 bits.

Evidence that holds up

Every investigation produces a SHA-256 hash chain following the Berkeley Protocol on Digital Open Source Investigations (OHCHR, 2020). Each entry records: timestamp, content hash, source, description, and a chain hash linking to the previous entry. Altering any entry invalidates every subsequent hash.

Independent verification uses three methods: dual-source DNS (same query to Cloudflare 1.1.1.1 and Google 8.8.8.8 — if both agree, fabrication is implausible), RFC 3161 trusted timestamps from FreeTSA.org, and Wayback Machine archival. The forensic report includes ACPO alignment assessment and error rates for each module, per Criminal Practice Direction 19A (2014).

From the source

Dempster's combination rule — conflict mass K normalises the resultpackages/core/src/fusion/dempster-shafer.ts

export function combine(m1: MassFunction, m2: MassFunction): MassFunction {
  const aa = m1.attributed * m2.attributed
  const au = m1.attributed * m2.uncertain
  const ua = m1.uncertain * m2.attributed
  const nn = m1.not_attributed * m2.not_attributed
  const nu = m1.not_attributed * m2.uncertain
  const un = m1.uncertain * m2.not_attributed
  const uu = m1.uncertain * m2.uncertain

  const K = m1.attributed * m2.not_attributed
    + m1.not_attributed * m2.attributed
  const norm = 1 - K

  if (norm <= 0) {
    return { attributed: 0, not_attributed: 0, uncertain: 1,
      source: `${m1.source}+${m2.source}` }
  }

  return {
    attributed: (aa + au + ua) / norm,
    not_attributed: (nn + nu + un) / norm,
    uncertain: uu / norm,
    source: `${m1.source}+${m2.source}`,
  }
}

Anonymity quantification — information gain reduces the suspect setpackages/core/src/entropy/anonymity.ts

export function computeAnonymity(
  population: number,
  evidence: EvidenceItem[],
  failedCollectors: FailedCollector[] = [],
): AnonymityAssessment {
  const priorBits = priorAnonymity(population)

  const totalGainBits = evidence.reduce(
    (sum, e) => sum + e.informationGain * e.confidence, 0,
  )

  const remainingBits = Math.max(0, priorBits - totalGainBits)
  const anonymitySet = anonymitySetSize(remainingBits)

  return {
    priorBits, totalGainBits, remainingBits,
    remainingUpper: remainingBits,
    anonymitySet,
    anonymitySetUpper: anonymitySetSize(remainingBits),
    population,
    breakdown: [...evidence].sort(
      (a, b) => b.informationGain * b.confidence
        - a.informationGain * a.confidence,
    ),
    failedCollectors,
    identified: remainingBits < 1,
    complete: failedCollectors.length === 0,
  }
}

What it does not do

▲Stylometry has not been benchmarked against a labeled dataset. Confidence intervals widen with shorter text. Below 50 words, results are unreliable and marked as such.
▲AI text detection is statistical only — no neural model. Industry tools (GPTZero, Originality.ai) achieve 88-92%. This detector scores lower. It flags indicators for further investigation, not determinations.
▲Review suspicion heuristics are keyword-based. A competent attacker writes around them. The tool catches unsophisticated attacks, which are the majority.
▲The evidence chain proves integrity (not altered after capture). Proving authenticity (data was real when captured) requires the independent verification modules — dual-source DNS, RFC 3161 timestamps.
▲73% of gTLD domains now have redacted WHOIS data (post-GDPR). When the registrant is hidden, the strongest attribution layer returns almost nothing. Historical WHOIS (pre-2018 snapshots) partially compensates.
▲A sophisticated attacker using VPN, AI-generated text, purchased aged accounts, and no shared infrastructure between their real identity and the attack will not be attributed by this tool. The tool is honest about this in every report.

Stack

TypeScript. Zero runtime dependencies in core. Dempster-Shafer evidence fusion, Shannon entropy, Fellegi-Sunter record linkage, Jaro-Winkler similarity, spectral graph clustering, Kolmogorov-Smirnov timing analysis, Writeprints stylometric features, Jensen-Shannon divergence. RDAP (RFC 9082), raw WHOIS, crt.sh, ip-api.com, archive.org. SHA-256 evidence chain. Dual-source DNS verification. RFC 3161 timestamps. 12 UK statutes cited in every forensic report.

Authorised use

Intended for investigations where you have a lawful basis to handle the underlying data. Outputs are bounded uncertainty estimates; on their own they are not an attribution to a named individual. Distribution under the LICENSE in the repository is not an invitation to commit an offence under section 3A of the Computer Misuse Act 1990. See /scope.