Metadata is data about data. Every file format that anyone bothered to make professionally useful includes fields for it: who created the file, what software, when, on which device. The fields exist because they're genuinely useful — file managers can sort photos by date, content management systems can attribute documents to authors, forensic tools can verify chains of custody. The same fields, in the wrong context, are a beacon pointing back to you.
What's Actually in Common File Types
The metadata surface varies dramatically by format.
JPEG and HEIC photos carry EXIF, IPTC, and XMP metadata. EXIF alone typically includes camera make and model, lens, aperture, shutter speed, ISO, date and time the photo was taken, and — if location services were enabled — exact GPS coordinates accurate to a few meters. Many cameras also embed a serial number that uniquely identifies the physical device. iPhone photos carry HEIC variants with the same kinds of fields plus Apple-specific items.
PDFs carry author, title, subject, keywords, creator software, producer software, creation date, and modification date — all in the document info dictionary. They also carry the entire edit history if "incremental updates" was used (the default in many editors). Deleted pages, rewritten paragraphs, and earlier versions all live in the PDF stream until the document is "flattened" and the deleted objects are garbage-collected.
Word documents (.docx) are ZIP archives containing XML. The core.xml and app.xml streams record author names, last-modified-by usernames, edit times in minutes, the template the document was created from, and your installed company name. The document.xml may carry tracked changes and review comments even after you "accept all" — they can survive in revision-history streams.
LibreOffice and OpenDocument files have the same structure as DOCX and the same risks. ODT files carry author names and edit times in `meta.xml`.
Excel and PowerPoint are no different — and PowerPoint additionally stores cropped image regions in full, so an "image cropped to the safe portion" can be uncropped by an attacker who unzips the .pptx and inspects the original asset.
Real Cases Where Metadata Mattered
Vice Magazine and John McAfee, 2012. A Vice reporter posted a photo of himself with the fugitive John McAfee captioned with location-suggestive language but no explicit address. The JPEG carried unstripped EXIF GPS coordinates pointing to a resort in Guatemala. Authorities tracked McAfee within hours. The reporter and Vice publicly apologized for the operational error.
TSA operating manual, 2009. The Transportation Security Administration published a redacted version of their screening procedures manual as a PDF. The "redaction" was black rectangles drawn over the text in an overlay layer — the underlying text was selectable and could be copied out. The classified procedures, including how to handle diplomatic and CIA personnel, were widely republished within days.
Microsoft Word author leaks, ongoing. Documents leaked to journalists routinely identify their authors via the `last-modified-by` field, which records the Windows username of whoever most recently edited the file. The default username is set during operating system installation and is rarely changed. Anonymous documents are often not anonymous.
Lance Reynolds and the printer steganography case, 2017. NSA contractor Reality Winner mailed a printed document about Russian election interference to The Intercept. The Intercept's editorial process included scanning the document and presenting it to NSA officials for verification — the scanned image included the printer's invisible microdot tracking pattern, which encoded the printer's serial number and the timestamp of the print job. NSA identified Winner within days from those microdots and the printer's audit log. She was sentenced to five years.
Metadata burns sources because it's deterministic, hard to remember about, and present in every file by default. The decision to leak is conscious; the decision to leave EXIF on usually isn't.
How to Strip It
The single most useful tool is ExifTool by Phil Harvey. It is the standard for reading and writing metadata in nearly every file format that has metadata, and it's free and open source.
| File type | Command |
|---|---|
| Any image | exiftool -all= -overwrite_original photo.jpg |
| PDF (basic metadata only) | exiftool -all= -overwrite_original doc.pdf |
| PDF (also flatten edit history) | gs -o clean.pdf -sDEVICE=pdfwrite doc.pdf |
| DOCX / XLSX / PPTX | exiftool -all= -overwrite_original doc.docx |
| Verify what's left | exiftool -a -G1 -s file.ext |
For Word documents, exiftool handles the obvious metadata, but tracked changes and comments live inside the document XML and need to be stripped via the application itself: File → Inspect Document → Remove All. For PDFs created by a long edit pipeline, the safest approach is to print to PDF from a fresh tool — that produces a new document with no inherited history.
The Operating System Pass
For high-stakes leaks, the workflow used by experienced journalists looks roughly like this:
- Open the file on an offline workstation, ideally Tails or an equivalent amnesic OS.
- If the file is a scan, re-scan or re-render through a tool that does not preserve the original bytes (export through a flat raster image, then re-OCR if needed).
- Strip all metadata with exiftool.
- Compare a hex dump of the cleaned file to the original — look for any embedded strings that include usernames, paths, or device identifiers.
- Publish from a different network than the one used to source the file.
This is overkill for most uses. It is the appropriate baseline if a source's safety depends on the publication being deanonymized.
Metadata in Messaging
Sending a file through a messaging app adds another layer of metadata management — some apps strip EXIF before forwarding, some don't. Signal and WhatsApp strip image metadata by default. iMessage strips most EXIF on send. Telegram does not strip EXIF unless you send the photo as a regular message (it does strip on "compressed" sends but not "uncompressed"). Most email clients pass attachments through untouched.
The conservative rule: assume any file you attach to anything is delivered with its metadata intact unless you explicitly stripped it.
The Underlying Principle
Metadata is the difference between what you meant to share and what you actually shared. The gap is closed by tools, not by intent. Every file format you'll ever use was designed to record context, because context is useful — until it isn't. The cost of a stray field is paid by the person on the other end of the leak, and the cost of forgetting is asymmetric: nothing happens 999 times out of 1,000, and on the thousandth time someone goes to prison.
For most files, none of this matters. For the ones that do, treat metadata as a class of redaction failure — a thing that's easy to miss, easy to fix, and catastrophic when it slips past you.