What AI Can Infer About You: Machine Learning and the Surveillance Problem

The surveillance problem used to be about collection: who has your data. Machine learning has shifted it toward inference: what patterns in that data reveal. Even sparse, anonymized datasets can now yield conclusions their collectors never explicitly measured. That changes the calculus of privacy significantly.

Cryptographic tools protect data in transit and at rest. They do not protect against inference — the extraction of sensitive conclusions from data that was intentionally collected or incidentally exposed. As machine learning models have grown more capable, the gap between "what was recorded" and "what can be derived" has widened considerably.

This piece covers specific inference capabilities that are technically established, what data sources enable them, and what practical defenses exist. We're deliberately not covering capabilities that are speculative or where published evidence is thin — there's enough in the documented record to warrant concern without extrapolation.

Gait Recognition and Physical Identity

Facial recognition is the most discussed biometric surveillance technology, but it has an obvious counter: cover your face. Gait recognition — identifying individuals by their walking pattern — does not. Published academic work has demonstrated gait recognition from low-resolution CCTV footage at distances up to 50 meters, where facial details are not resolvable. The input data is silhouette and motion pattern only.

China's Ministry of Public Security has deployed commercial gait recognition systems in major cities, developed by companies including Watrix. This is documented by journalism and the companies' own marketing. The systems are designed specifically for cases where faces cannot be captured.

The privacy implication: physical anonymity in public spaces requires more than concealing your face. It requires fundamentally different movement, which is not practically achievable for most people over time.

Location Data and Sensitive Inference

Raw location data — latitude/longitude with timestamps — appears mundane until you consider what can be inferred from it. Published research has demonstrated that regular location patterns can reveal:

Home and work addresses (most common overnight and daytime locations)
Medical conditions (regular visits to specialist clinics, treatment centers, or pharmacies)
Religious practice (weekly attendance at a place of worship)
Political activity (presence at protests, campaign events, or party offices)
Relationship status and social graph (co-location patterns with other devices)

None of these inferences require access to messages, search history, or any explicitly sensitive data. Location traces alone, correlated with points of interest, yield them.

The data broker layer

Precise location data is commercially available via the data broker industry. Mobile apps that request location permission — weather apps, games, retail apps — frequently sell this data to aggregators who resell it to advertisers, insurance companies, hedge funds, and government agencies. The purchase of this data typically avoids Fourth Amendment warrant requirements in the United States because it is voluntary commercial data, not a government search. See our piece on data broker opt-out for the practical picture.

Writing Style and Authorship Attribution

Stylometric analysis — identifying authors by writing patterns — has a long history in literary scholarship and forensic linguistics. Large language models have made it dramatically more accessible and accurate. The attributes that stylometric classifiers use include: sentence length distribution, punctuation habits, vocabulary breadth, word frequency patterns, spelling tendencies, and syntactic structure.

Research published in academic venues on authorship attribution has demonstrated identification of authors across anonymized writing samples with high accuracy given sufficient training data. The practical implication: if you write publicly under your real name in one context and anonymously in another, the two corpora can often be linked by style alone.

This matters for anyone using pseudonymous accounts, leaking documents, or writing in an anonymous publication. The defense — deliberately altering writing style — is cognitively demanding and difficult to maintain consistently. Machine translation to another language and back introduces noise but also artifacts. There is no robust, easy countermeasure.

Metadata Patterns in Communications

The NSA's metadata collection programs, revealed by Edward Snowden in 2013, established that communications metadata — who talks to whom, when, and for how long — is treated as legally distinct from content and collected at scale. Former NSA Director Michael Hayden stated publicly that "we kill people based on metadata."

Machine learning applied to communications metadata can reveal organizational structure, identify key nodes in social networks, track relationship formation and dissolution, and infer topics of discussion from timing patterns. Subject lines of emails, message lengths, and inter-message gaps are metadata, not content — and they are informative.

This is why end-to-end encryption's limits matter: E2E protects message content, not metadata. The metadata — who communicates with whom, when, how frequently — remains visible to service providers and potentially compelled from them.

What This Means for Threat Modeling

The practical consequence of inference capabilities is that traditional anonymization — removing names and obvious identifiers from data — is no longer sufficient protection. Re-identification of "anonymized" datasets by cross-referencing with other data sources has been demonstrated repeatedly in academic literature. A dataset of anonymized Netflix ratings was de-anonymized by correlating with public IMDb ratings. Anonymized mobility data has been used to re-identify individuals using only four location points.

Data Type	What AI Can Infer	Practical Defense
Location history	Home, work, health, politics, relationships	Minimize app location permissions; use a phone without a SIM when attending sensitive events
Writing samples	Link pseudonymous and real-name writing	No easy one; segment identities rigorously, consider machine translation noise
Communications metadata	Social graph, organizational structure, topic inference	E2E messaging with metadata-minimal design; avoid centralized platforms for sensitive comms
Video surveillance	Identity via face, gait, clothing patterns	Face covering (partial); no reliable gait countermeasure exists
Browsing patterns	Interests, ideology, health, finances	DNS-over-HTTPS, browser compartmentalization, Tor for sensitive research

What Encrypted Communication Protects Against

End-to-end encrypted messaging protects message content from surveillance by the service provider and anyone who intercepts traffic in transit. It does not protect the metadata envelope — who communicates with whom and when. It does not protect against inference from separately collected data sources.

This is not an argument against encrypted communication — it's an argument for understanding what each layer of protection actually covers. Modern messaging protocols like MLS provide strong cryptographic guarantees for content and forward secrecy. They operate at the content layer. The metadata layer and the inference layer require separate, complementary approaches.

The honest framing: encryption remains the most effective tool for protecting message content. It does not solve the inference problem. Both are real threats; both deserve attention.

Toward Practical Responses

The threat landscape from AI-enhanced inference doesn't have a clean technical solution the way transport encryption does. But practical hygiene matters:

Minimize data generation where possible — use encrypted DNS, block trackers, limit location permissions to apps that genuinely need them.
Use communication tools where metadata exposure is minimized, not just content-encrypted.
Separate identities genuinely — different devices, different email addresses, not just different usernames on the same platform.
Treat anonymized data skeptically — assume it can be re-identified with sufficient corroborating data.

None of this provides absolute protection. It raises the cost and complexity of surveillance. For most threat models — data brokers, commercial tracking, targeted advertising, opportunistic surveillance — that's sufficient. For the highest-stakes adversaries, there is no complete defense, and being honest about that matters.