Writing

Same person, five data sources, zero ground truth

Giuseppe Giona·
What this covers
  • • Why comparing every pair of entities is too slow, and how a blocking index fixes it.
  • • Jaro-Winkler over Levenshtein — transpositions matter more than insertions for names.
  • • Dempster-Shafer evidence fusion — weighted averages can't say "I don't know."
  • • Real code from the threadr repo. Not pseudocode.

The problem

threadr runs 17 plugins against a seed (an email, a domain, a username). Each plugin returns nodes and edges — GitHub finds a user profile, HIBP finds breach records, DNS finds mail servers, Shodan finds open ports. After a scan, you have a graph with hundreds of entities from independent sources that know nothing about each other.

Some of those entities are the same real person. GitHub says "johndoe". A git commit says "[email protected]". Gravatar returns the same avatar hash for both. Are they the same person? Probably. But "probably" isn't an answer you can write to a graph database.

You need a number. And you need it to mean something.

Naive approach: compare everything

The obvious solution is to compare every person entity against every other. Five fields per entity (emails, usernames, names, phone numbers, avatar hashes), string similarity on each, weighted average. Simple.

1,000 person entities means 499,500 pairs. Each pair checks up to 5 fields, each field does string comparisons across lists. It's O(n²) and it's already slow at the scale of a single scan with a high-connectivity seed.

Blocking index

The insight: most pairs aren't worth comparing. Two people who share no emails, no usernames, no avatar hash, and no name fragments are almost certainly different. Only pairs that share at least one token are candidates.

Build an inverted index. For each person, extract blocking keys: exact email addresses, exact usernames, exact avatar hashes, and character bigrams from names. If two people share a key, they're candidates for full comparison. If they share nothing, skip them.

// from packages/shared/src/scoring.ts
export function blockingKeys(fields: EntityFields): string[] {
  const keys: string[] = []

  for (const email of fields.emails)
    keys.push(`e:${email.toLowerCase()}`)

  for (const username of fields.usernames)
    keys.push(`u:${username.toLowerCase()}`)

  if (fields.avatarHash)
    keys.push(`a:${fields.avatarHash}`)

  // Name bigrams catch fuzzy matches (typos, abbreviations)
  // "John Doe" → ["jo", "oh", "hn", "n ", " d", "do", "oe"]
  for (const name of fields.names) {
    const lower = name.toLowerCase()
    for (let i = 0; i < lower.length - 1; i++)
      keys.push(`b:${lower[i]}${lower[i + 1]}`)
  }
  return keys
}

Name bigrams are the interesting part. Exact email matches catch the easy cases. Bigrams catch "Jon Doe" vs "John Doe"— they share 5 of 7 bigrams, so they end up in the same candidate buckets.

One guard: buckets with more than 50 entities are noise. The bigram "th" matches half the English-speaking dataset. Those get dropped.

Result: 1,000 persons typically produces around 5,000 candidate pairs instead of 499,500. Two orders of magnitude.

Figure 1 — Candidate-pair count vs entity count (log-log)O(N²) vs O(N)
1001k10k100k1M10M100200500100020005000entity count N (log scale)candidate pairs (log scale)brute force · O(N²)blocking · O(N)≈ 100× gap at N=1000499,500 pairs5,000 pairs
Brute-force comparison is N(N − 1)/2 — quadratic, slope 2 on a log-log axis. Inverted-index blocking caps each bucket at ~50 entities and discards over-frequent bigrams (“th”, “er”); the remaining work scales empirically like 5N for the post’s sample — linear, slope 1. The cross-section at N=1000 is the post’s anchoring claim: 499,500 candidate pairs reduced to 5,000. Code in packages/shared/src/scoring.ts.

Why Jaro-Winkler

Once you have candidate pairs, you need to score the similarity of each field. Levenshtein distance is the default choice but it's wrong for this problem. Levenshtein counts insertions, deletions, and substitutions equally. But in OSINT data, the dominant error mode is transposition — "marhta" for "martha", "johndoe" for "jondoe".

Jaro similarity handles transpositions natively. Jaro-Winkler adds a prefix bonus because strings that start the same way are more likely to be variants of the same name. The maths:

// Jaro similarity:
// J = (1/3) × (m/|s1| + m/|s2| + (m - t/2)/m)
// where m = matching chars, t = transpositions
//
// Jaro-Winkler boost:
// JW = J + p × (1 - J)
// where p = common prefix length (max 4) × 0.1

export function jaroWinkler(s1: string, s2: string): number {
  const j = jaro(s1, s2)
  let prefix = 0
  for (let i = 0; i < Math.min(s1.length, s2.length, 4); i++) {
    if (s1[i] === s2[i]) prefix++
    else break
  }
  return j + prefix * 0.1 * (1 - j)
}

"martha" vs "marhta": Levenshtein gives distance 2 (swap t and h). Jaro-Winkler gives 0.961. The Winkler part pushes it higher because they share the prefix "mar".

Field weights and the single-field cap

Not all fields are equally informative. A matching email is near-certain evidence of the same person. A matching name is barely evidence at all — there are a lot of John Smiths.

const WEIGHTS: Record<string, number> = {
  email: 0.95,
  phone: 0.90,
  avatar: 0.80,
  username: 0.70,
  name: 0.40,
}

One thing I added after testing: if the comparison only has a single field in common (say two people share an email but we have nothing else to compare), the score is capped at 0.6 regardless of the weighted result. A single data point isn't enough for a confident merge. You need corroboration.

Where weighted averages break down

The weighted average works when every field has an opinion. It breaks when plugins disagree.

Say GitHub matches with 0.95 confidence on username. HIBP matches with 0.90 on email. But Gravatar says the avatars are completely different — 0.0 similarity. The weighted average blends these into some middle number. But the right answer isn't a middle number. The right answer is: "strong evidence for, strong evidence against, here's how much they conflict."

Weighted averages can't express uncertainty. They can't express conflict. They just produce a number that looks confident whether or not it should be.

Dempster-Shafer evidence fusion

Dempster-Shafer theory lets you work with three states instead of two. For each field comparison, the mass function distributes belief across:

  • {SAME} — evidence they're the same person
  • {DIFFERENT} — evidence they're different
  • {SAME, DIFFERENT} — the field can't tell (uncertainty)

The field's reliability controls how much mass is informative vs uncertain. A name match (reliability 0.35) puts 65% of the mass into uncertainty. An email match (reliability 0.92) puts only 8% into uncertainty. This is what the old weights were trying to express, but now it has proper probabilistic semantics:

export function createMass(
  similarity: number,
  fieldReliability: number,
  source: string,
): MassFunction {
  const informative = fieldReliability
  const same = informative * similarity
  const different = informative * (1 - similarity)
  const uncertain = 1 - informative
  return { same, different, uncertain, source }
}

export const FIELD_RELIABILITY: Record<string, number> = {
  email: 0.92,     // near-unique but can be shared (work inboxes)
  phone: 0.88,     // can be recycled or shared
  avatar: 0.75,    // same hash unlikely but same avatar ≠ same person
  username: 0.65,  // common usernames reduce reliability
  name: 0.35,      // "John Smith" is not evidence of anything
}

Dempster's combination rule fuses multiple mass functions. When two fields agree, belief accumulates. When they conflict — one says SAME, the other says DIFFERENT — the conflict mass K gets normalised out. High conflict means the evidence is contradictory and the result carries a warning:

// Dempster's combination rule
// For Θ = {SAME, DIFFERENT}:
// K = m1(SAME) × m2(DIFFERENT) + m1(DIFFERENT) × m2(SAME)
const K = m1.same * m2.different + m1.different * m2.same
const norm = 1 - K

// Total conflict → return max uncertainty
if (norm <= 0)
  return { same: 0, different: 0, uncertain: 1, source }

const same = (sameSame + sameUncertain + uncertainSame) / norm
const different = (diffDiff + diffUncertain + uncertainDiff) / norm
const uncertain = uncertainUncertain / norm

The output is a belief interval: Bel(SAME) to Pl(SAME). If belief is 0.87 and plausibility is 0.92, you have strong evidence. If belief is 0.3 and plausibility is 0.8, you have a wide interval — the data isn't conclusive.

The thresholds

The resolver uses two thresholds. Above 0.85: auto-merge, write a PROBABLY_IS edge with auto: true. Between 0.6 and 0.85: suggest the merge but don't auto-apply, flag it for human review. Below 0.3: skip entirely. The gap between 0.3 and 0.6 is intentional — that's the zone where the data is ambiguous enough that even flagging it would be noise.

// from apps/worker/src/resolver.ts
for (const [i, j] of pairs) {
  const breakdown = compareEntities(indexed[i].fields, indexed[j].fields)
  const score = computeScore(breakdown)

  if (score < 0.3) continue

  if (score >= 0.85) {
    await writeProbablyIs(persons[i].id, persons[j].id, score, true)
    merged++
  } else if (score >= 0.6) {
    await writeProbablyIs(persons[i].id, persons[j].id, score, false)
    suggested++
  }
}

The output is visible in the graph: auto-merged edges are solid, suggestions are dashed. The analyst can accept or reject them in the UI.

What I'd change

The 0.85 threshold is a guess. I picked it from testing against a few dozen manually-labelled pairs from real scans. It's not calibrated against a proper benchmark dataset because there isn't one for this specific problem — OSINT entity resolution across heterogeneous sources. The closest is the Febrl synthetic dataset for record linkage, but that's medical records, not internet personas.

The D-S fusion layer is implemented but not yet wired into the resolver's actual scoring path. The resolver still uses the weighted average with the single-field cap. The D-S code lives in packages/shared/src/analytics/dempsterShafer.ts and passes all its mathematical property tests (commutativity, associativity, vacuous mass identity, conflict normalisation), but integrating it into the live resolver changes the scoring behaviour and I haven't validated the new thresholds yet. That's next.

The blocking index also misses some cases. Two people who share no tokens at all (different email providers, different usernames, different name spellings) but are actually the same person won't become candidates. The blocking step is a precision/recall tradeoff — it heavily favours precision (few false candidates) over recall (catching every match). For OSINT work that's usually the right call, but it's a known gap.

Source: github.com/Giuseppe552/threadr