PDF Changer

21 browser-only PDF tools. Nothing leaves the browser. I built it because every online PDF tool I tested sends your files to a server, and most of them pipe your download tokens through Google Analytics.

TypeScriptReactpdf-libPDF.jsTesseract.jsWeb CryptoHonoCloudflare WorkersD1WebAuthn
pdfchanger.org Source 21 tools · 299 tests · 29k lines

The core problem

I ran network captures on iLovePDF and Smallpdf. iLovePDF sends your download token to Google Analytics — so Google knows you merged a tax return. Smallpdf made 215 requests during a single merge operation. Both upload your files to their servers for processing.

PDF Changer processes everything client-side. The CSP header blocks all network access from the processing sandbox. If the code tries to phone home, the browser kills the request and the VPE monitor logs the attempt.

Verified Processing Environment

Three monitors run concurrently during every operation: PerformanceObserver watches for network activity, a CSP violation listener catches blocked requests, and a MutationObserver detects DOM injection. WebRTC is monkey-patched to block ICE candidate leaks. All events hash into a tamper-evident HMAC chain — alter one event and every subsequent hash breaks.

The output is a downloadable audit report: timestamped event log, HMAC chain integrity proof, and a summary of what happened during processing. The user can verify independently that nothing leaked.

Byte-level metadata stripping

pdf-lib can't reach embedded image streams. So the scrubber scans raw bytes for JPEG SOI markers (FF D8) and PNG magic (89 50 4E 47), then excises APP1/APP2/APP13/APP14 segments and tEXt/iTXt/eXIf/iCCP chunks at the byte level. Overlapping regions get merge-sorted before excision.

Most "metadata removal" tools don't touch embedded images. I only found out by hex-dumping a pdf-lib output.

Printer tracking dots

Colour laser printers embed a 15×8 grid of yellow dots on every page — date, time, serial number, encoded in a pattern documented by the EFF and TU Dresden. PDF Changer decodes the Xerox DocuColor pattern and shows the user what's embedded before they share the document.

Only covers Xerox DocuColor patterns. Other manufacturers use different encodings that aren't publicly documented.

Figure 1 — Metadata-removal coverage by category3 tools × 6 categories
pdf-lib aloneMost online toolsPDF Changerbyte-level scannerPDF propertiesJPEG EXIFJPEG IPTCPNG metadataXerox dotsOther printerscoverage1/61/65/6no tool covers this — honest limit
Coverage of the six metadata categories an attacker can read out of a “cleaned” PDF. The categories are PDF properties (Info dict: Title / Author / Creator), JPEG EXIF (APP1 / APP2 segments inside embedded JPEGs), JPEG IPTC (APP13), PNG metadata (tEXt / iTXt / eXIf / iCCP chunks), Xerox printer dots (the 15×8 DocuColor pattern), and other manufacturers’ printer dots (undocumented). pdf-lib alone clears PDF-level fields and stops there — the embedded image streams hold separate metadata that the library cannot reach. PDF Changer’s differentiator is the byte-level scanner in src/scrub/jpeg.ts and src/scrub/png.ts; Xerox dots are decoded, other manufacturers’ patterns aren’t publicly documented and no tool clears that column.

From the source

EXIF scanner — finds JPEG markers in raw PDF bufferssrc/scrub/jpeg.ts
function findJpegMarkers(buf: Uint8Array): Marker[] {
  const markers: Marker[] = [];
  for (let i = 0; i < buf.length - 1; i++) {
    if (buf[i] !== 0xff || buf[i + 1] !== 0xd8) continue;
    // Found SOI — walk forward to find APP segments
    let pos = i + 2;
    while (pos < buf.length - 3) {
      if (buf[pos] !== 0xff) break;
      const type = buf[pos + 1];
      const len = (buf[pos + 2] << 8) | buf[pos + 3];
      if (type >= 0xe0 && type <= 0xef) {
        markers.push({ offset: pos, type, length: len + 2 });
      }
      pos += len + 2;
    }
  }
  return markers;
}
HMAC chain — each event hashes the previoussrc/vpe/chain.ts
async function appendEvent(chain: Chain, event: VpeEvent): Promise<Chain> {
  const payload = JSON.stringify({
    seq: chain.length,
    prev: chain.at(-1)?.hash ?? GENESIS,
    event,
    ts: Date.now(),
  });
  const hash = await hmacSha256(chain.key, payload);
  return [...chain, { payload, hash }];
}

What it doesn't do

  • No E2E browser tests. Unit and integration tests cover the processing pipeline but not the full UI flow.
  • Printer dot decoding only covers Xerox DocuColor. Other manufacturers use undocumented patterns.
  • The VPE audit is self-attested. The site delivers the JavaScript that creates the audit — a compromised server could serve modified code. This is the code delivery problem (same limitation as ProtonMail, Bitwarden, MEGA).
  • OCR (Tesseract.js) runs in the browser. Accuracy on scanned documents varies significantly with image quality.

Stack

React SPA + pdf-lib + PDF.js + Tesseract.js + Web Crypto (browser layer). Iframe sandbox with CSP connect-src 'none'. Service Worker for PWA/offline. Hono + Cloudflare Workers + D1 (edge API). WebAuthn passkeys + ECDSA P-256 offline entitlements.