PDF metadata removal misses embedded images. Here's what's still in the file.
- • PDF files embed JPEG and PNG images as raw byte streams. These streams contain their own metadata: EXIF (APP1), ICC profiles (APP2), IPTC (APP13), and PNG tEXt/eXIf chunks.
- • PDF-level metadata stripping (setting Title, Author, etc. to empty) does not touch the image streams. The EXIF data survives.
- • pdf-lib operates on PDF objects. It cannot access the raw bytes inside image streams. I tested this by hex-dumping pdf-lib output.
A PDF is a container. When you insert a photograph, the PDF embeds the raw JPEG byte stream. The EXIF data — GPS coordinates, camera model, timestamps — goes along for the ride.
Tools that “remove PDF metadata” typically clear the PDF Info dictionary: Title, Author, Creator, Producer, dates. Those are PDF-level fields. The EXIF sitting inside the embedded image stream is a completely different structure, and pdf-lib — the library I was using — doesn't touch it.
I found out by hex-dumping the output of my own scrub function. Info dictionary: clean. Embedded JPEG: still had the full APP1 segment with EXIF headers intact.
What's inside an embedded JPEG
JPEG files start with an SOI marker: FF D8. After the SOI come APP segments, identified by markers FF E0 through FF EF. The relevant ones for metadata:
| Marker | Hex | Contents |
|---|---|---|
| APP1 | FF E1 | EXIF (camera, GPS, timestamps) and XMP (editing history) |
| APP2 | FF E2 | ICC colour profile (can identify the device) |
| APP13 | FF ED | IPTC (captions, copyright, location, creator) |
| APP14 | FF EE | Adobe colour transform data |
PNG is the same story but different structure. Metadata lives in ancillary chunks: tEXt, iTXt, zTXt for text metadata, eXIf for EXIF (registered as a PNG extension, adopted in practice since ~2017), and iCCP for ICC profiles. The PDF container ignores all of them.
Why pdf-lib can't help
pdf-lib works at the object level. It sees pages, fonts, metadata dictionaries. When there's an embedded image, pdf-lib sees an image XObject — dimensions, colour space, a reference to the stream. The raw stream bytes? Opaque. It doesn't go inside.
I checked the spec to see if this was intentional. ISO 32000-2:2020 section 8.9 defines image streams as data to be decoded by the declared filter (DCTDecode for JPEG, FlateDecode for compressed PNG). The APP segments ride along inside the compressed payload. A PDF-level tool has no reason to parse them — they're not PDF structures.
pdf-lib alone clears PDF-level fields and stops there — the embedded image streams hold separate metadata that the library cannot reach. PDF Changer’s differentiator is the byte-level scanner in src/scrub/jpeg.ts and src/scrub/png.ts; Xerox dots are decoded, other manufacturers’ patterns aren’t publicly documented and no tool clears that column.Byte-level scanning
The alternative is to scan the raw PDF bytes for image stream markers and strip the metadata segments directly:
// JPEG APP markers to strip: EXIF, ICC, IPTC, Adobe
const JPEG_STRIP_MARKERS = new Set([0xe1, 0xe2, 0xed, 0xee])
for (let i = 0; i < bytes.length - 3; i++) {
if (bytes[i] !== 0xff) continue
const marker = bytes[i + 1]
if (!JPEG_STRIP_MARKERS.has(marker)) continue
// Verify we're inside a JPEG by looking back for SOI (FF D8)
let foundSoi = false
for (let j = i - 1; j >= Math.max(0, i - 4096); j--) {
if (bytes[j] === 0xff && bytes[j + 1] === 0xd8) {
foundSoi = true
break
}
}
if (!foundSoi) continue
// Read segment length (2 bytes, big-endian)
const segLen = (bytes[i + 2] << 8) | bytes[i + 3]
segments.push({ start: i, length: 2 + segLen })
}The backward scan for SOI matters. FF E1 appears by coincidence in binary data all the time. Checking for an SOI within 4KB before the marker reduces false positives — though it doesn't eliminate them entirely, since FF D8 is only two bytes and can appear in entropy-coded scan data too. In practice, the combination of SOI proximity plus valid segment length has been reliable enough. I haven't hit a false positive in testing, but I wouldn't claim it's impossible.
For PNG: find the 8-byte magic (89 50 4E 47 0D 0A 1A 0A), then walk the chunk structure. Each chunk is 4 bytes length, 4 bytes type, data, 4 bytes CRC. Strip any chunk whose type is in the metadata set.
const PNG_STRIP_TYPES = new Set([ "tEXt", // uncompressed text (Comment, Author, etc.) "iTXt", // international text (UTF-8) "zTXt", // compressed text "eXIf", // EXIF data (PNG extension, ~2017) "iCCP", // ICC colour profile ])
When EXIF is followed immediately by an ICC profile (common), the excision regions are adjacent or overlapping. The code sorts by offset and merges before cutting to keep the output structurally valid.
What this doesn't cover
The SOI scan window is 4KB. If a JPEG stream somehow has more than 4096 bytes between SOI and the first APP marker, the scanner misses it. APP segments appear right after SOI in every file I've tested, but I can't rule out edge cases.
The PNG chunk parser trusts declared lengths. A malformed PNG with wrong length fields could misalign the walk. Bounds-checking prevents buffer overruns, but CRC isn't validated.
The whole approach operates on raw PDF bytes, not decompressed streams. If a PDF applies FlateDecode on top of a JPEG stream (double-encoding), the JPEG markers are hidden inside the compressed data. I haven't encountered this in the PDFs I've processed, but I don't know how common it is across all PDF generators.
EXIF strip implementation: PDF Changer repo under apps/web/src/utils/pdf/exifStrip.ts and exifDetect.ts. Tests in exifStrip.test.ts build synthetic JPEG and PNG payloads with known metadata segments and verify they're removed while preserving image structure (SOI/EOI for JPEG, magic/IEND for PNG).