Cryptographic tools protect data in transit and at rest. They do not protect against inference — the extraction of sensitive conclusions from data that was intentionally collected or incidentally exposed. As machine learning models have grown more capable, the gap between "what was recorded" and "what can be derived" has widened considerably.
This piece covers specific inference capabilities that are technically established, what data sources enable them, and what practical defenses exist. We're deliberately not covering capabilities that are speculative or where published evidence is thin — there's enough in the documented record to warrant concern without extrapolation.
Gait Recognition and Physical Identity
Facial recognition is the most discussed biometric surveillance technology, but it has an obvious counter: cover your face. Gait recognition — identifying individuals by their walking pattern — does not. Published academic work has demonstrated gait recognition from low-resolution CCTV footage at distances up to 50 meters, where facial details are not resolvable. The input data is silhouette and motion pattern only.
China's Ministry of Public Security has deployed commercial gait recognition systems in major cities, developed by companies including Watrix. This is documented by journalism and the companies' own marketing. The systems are designed specifically for cases where faces cannot be captured.
The privacy implication: physical anonymity in public spaces requires more than concealing your face. It requires fundamentally different movement, which is not practically achievable for most people over time.
Location Data and Sensitive Inference
Raw location data — latitude/longitude with timestamps — appears mundane until you consider what can be inferred from it. Published research has demonstrated that regular location patterns can reveal:
- Home and work addresses (most common overnight and daytime locations)
- Medical conditions (regular visits to specialist clinics, treatment centers, or pharmacies)
- Religious practice (weekly attendance at a place of worship)
- Political activity (presence at protests, campaign events, or party offices)
- Relationship status and social graph (co-location patterns with other devices)
None of these inferences require access to messages, search history, or any explicitly sensitive data. Location traces alone, correlated with points of interest, yield them.
Precise location data is commercially available via the data broker industry. Mobile apps that request location permission — weather apps, games, retail apps — frequently sell this data to aggregators who resell it to advertisers, insurance companies, hedge funds, and government agencies. The purchase of this data typically avoids Fourth Amendment warrant requirements in the United States because it is voluntary commercial data, not a government search. See our piece on data broker opt-out for the practical picture.
Writing Style and Authorship Attribution
Stylometric analysis — identifying authors by writing patterns — has a long history in literary scholarship and forensic linguistics. Large language models have made it dramatically more accessible and accurate. The attributes that stylometric classifiers use include: sentence length distribution, punctuation habits, vocabulary breadth, word frequency patterns, spelling tendencies, and syntactic structure.
Research published in academic venues on authorship attribution has demonstrated identification of authors across anonymized writing samples with high accuracy given sufficient training data. The practical implication: if you write publicly under your real name in one context and anonymously in another, the two corpora can often be linked by style alone.
This matters for anyone using pseudonymous accounts, leaking documents, or writing in an anonymous publication. The defense — deliberately altering writing style — is cognitively demanding and difficult to maintain consistently. Machine translation to another language and back introduces noise but also artifacts. There is no robust, easy countermeasure.
Metadata Patterns in Communications
The NSA's metadata collection programs, revealed by Edward Snowden in 2013, established that communications metadata — who talks to whom, when, and for how long — is treated as legally distinct from content and collected at scale. Former NSA Director Michael Hayden stated publicly that "we kill people based on metadata."
Machine learning applied to communications metadata can reveal organizational structure, identify key nodes in social networks, track relationship formation and dissolution, and infer topics of discussion from timing patterns. Subject lines of emails, message lengths, and inter-message gaps are metadata, not content — and they are informative.
This is why end-to-end encryption's limits matter: E2E protects message content, not metadata. The metadata — who communicates with whom, when, how frequently — remains visible to service providers and potentially compelled from them.
What This Means for Threat Modeling
The practical consequence of inference capabilities is that traditional anonymization — removing names and obvious identifiers from data — is no longer sufficient protection. Re-identification of "anonymized" datasets by cross-referencing with other data sources has been demonstrated repeatedly in academic literature. A dataset of anonymized Netflix ratings was de-anonymized by correlating with public IMDb ratings. Anonymized mobility data has been used to re-identify individuals using only four location points.
| Data Type | What AI Can Infer | Practical Defense |
|---|---|---|
| Location history | Home, work, health, politics, relationships | Minimize app location permissions; use a phone without a SIM when attending sensitive events |
| Writing samples | Link pseudonymous and real-name writing | No easy one; segment identities rigorously, consider machine translation noise |
| Communications metadata | Social graph, organizational structure, topic inference | E2E messaging with metadata-minimal design; avoid centralized platforms for sensitive comms |
| Video surveillance | Identity via face, gait, clothing patterns | Face covering (partial); no reliable gait countermeasure exists |
| Browsing patterns | Interests, ideology, health, finances | DNS-over-HTTPS, browser compartmentalization, Tor for sensitive research |
What Encrypted Communication Protects Against
End-to-end encrypted messaging protects message content from surveillance by the service provider and anyone who intercepts traffic in transit. It does not protect the metadata envelope — who communicates with whom and when. It does not protect against inference from separately collected data sources.
This is not an argument against encrypted communication — it's an argument for understanding what each layer of protection actually covers. Modern messaging protocols like MLS provide strong cryptographic guarantees for content and forward secrecy. They operate at the content layer. The metadata layer and the inference layer require separate, complementary approaches.
The honest framing: encryption remains the most effective tool for protecting message content. It does not solve the inference problem. Both are real threats; both deserve attention.
Toward Practical Responses
The threat landscape from AI-enhanced inference doesn't have a clean technical solution the way transport encryption does. But practical hygiene matters:
- Minimize data generation where possible — use encrypted DNS, block trackers, limit location permissions to apps that genuinely need them.
- Use communication tools where metadata exposure is minimized, not just content-encrypted.
- Separate identities genuinely — different devices, different email addresses, not just different usernames on the same platform.
- Treat anonymized data skeptically — assume it can be re-identified with sufficient corroborating data.
None of this provides absolute protection. It raises the cost and complexity of surveillance. For most threat models — data brokers, commercial tracking, targeted advertising, opportunistic surveillance — that's sufficient. For the highest-stakes adversaries, there is no complete defense, and being honest about that matters.