Unicode exists to represent every writing system humans use, which is an extraordinary engineering achievement and a security nightmare in equal measure. To handle right-to-left scripts, combining accents, and the thousands of visually similar glyphs across alphabets, the standard includes control characters that change rendering, multiple ways to encode the same visible string, and characters that produce no visible output at all. Every one of those features can be weaponized when a security boundary assumes text means exactly what it shows.
Trojan Source: Code That Lies to the Reviewer
In 2021, researchers Nicholas Boucher and Ross Anderson at Cambridge published "Trojan Source" (assigned CVE-2021-42574). The attack uses Unicode bidirectional override characters — control codes like Right-to-Left Override (U+202E) — to make source code display in a different order than it is compiled.
A compiler reads the raw byte sequence. A human reviewer reads the rendered output. Bidi overrides let an attacker arrange those two to disagree. A comment can be made to visually swallow an early return, or an if condition can appear to guard one branch while the logic runs another. The reviewer approves code that does something other than what their eyes told them. The proof-of-concept worked across C, C++, C#, JavaScript, Java, Rust, Go, and Python — it's a property of how editors render Unicode, not of any one language.
Trojan Source defeats human code review specifically — the control we lean on hardest for catching malicious changes. A backdoor introduced this way passes review precisely because the reviewer is competent and reading carefully. They're just reading a lie the renderer told them.
The disclosure prompted fast responses. GitHub, GitLab, and most major editors now warn when a file contains bidirectional control characters. Compilers including rustc added warnings for bidi codes in source. The defense is detection, not prevention: there's no way to forbid these characters globally — they're legitimately needed for Arabic and Hebrew string literals — so tooling flags their presence and lets humans decide.
Homoglyphs: Characters That Aren't What They Look Like
The Latin "a" (U+0061) and the Cyrillic "а" (U+0430) are different code points that render identically in almost every font. There are thousands of such pairs across Unicode's scripts. This is the basis of homograph attacks, where a domain like "аpple.com" — with a Cyrillic first letter — is registered to impersonate the real one.
Browsers fought this in the domain space with Punycode and registry-level script-mixing rules, but the technique generalizes well beyond URLs:
- Package names — a malicious dependency with a homoglyph name slips past a developer who copy-pastes from a poisoned tutorial; a vector in broader supply-chain typosquatting
- Usernames and display names — impersonating an admin or a known contact in a chat or forum
- Filter evasion — a blocklist matching the ASCII string never fires on the homoglyph variant
- Email sender spoofing — a display name that visually matches a trusted brand
Normalization Collisions
Unicode often provides more than one byte sequence for the same visible string. "é" can be a single precomposed code point (U+00E9) or an "e" followed by a combining acute accent (U+0065 U+0301). They look identical and, after normalization, are treated as equal. Unicode defines normalization forms (NFC, NFD, NFKC, NFKD) precisely to collapse these into a canonical representation.
The danger is inconsistent normalization across a trust boundary. If a system validates a username before normalizing but stores it after — or if two services normalize differently — an attacker can craft an input that passes the check as one value and is used as another. The compatibility forms (NFKC/NFKD) are especially aggressive: they map the ligature "fi" to "fi" and fold many lookalikes together, which is great for search and dangerous for security decisions made at the wrong moment.
The rule that prevents most normalization bugs: normalize input to a single canonical form at the system boundary, once, before any validation, comparison, or storage. Never compare two strings that were normalized at different times or by different code.
Zero-Width and Invisible Injection
Several Unicode characters produce no visible glyph: the zero-width space (U+200B), zero-width joiner (U+200D), zero-width non-joiner (U+200C), and others. They have legitimate typographic uses, but because they're invisible, they make excellent covert payloads.
Two distinct abuses matter. First, filter and detection evasion: inserting a zero-width character into the middle of a banned word breaks naive string matching while leaving the word perfectly readable to a human. Profanity filters, keyword blocklists, and signature-based malware detection have all been bypassed this way. Second, fingerprinting and watermarking: a unique pattern of invisible characters embedded in a document or message acts as a tracking beacon. If a leaked document carries a per-recipient zero-width watermark, the leaker can be identified from the copy alone — a concern for journalists and whistleblowers, and a close cousin of document metadata leakage.
More recently, the same invisible characters have become a vector for prompt injection against AI systems. Instructions hidden in zero-width or tag characters are invisible to a human reviewing the input but are tokenized and processed by a language model — a text-rendering gap reappearing in a brand-new context.
Defenses That Actually Work
There's no single switch, because Unicode's dangerous features are the same features that make it useful. The defenses are about deciding, per context, which slice of Unicode you actually need.
| Defense | Where it applies |
|---|---|
| Normalize at the boundary | Pick one form (usually NFC) and apply it once on input, before any validation or storage |
| Allowlist code-point ranges | For identifiers, usernames, and domains — restrict to the scripts you actually support rather than blocklisting bad characters |
| Detect and warn on control characters | Flag bidi overrides and zero-width characters in code, filenames, and security-sensitive fields |
| Confusable detection | Use the Unicode confusables data (UTS #39) to catch homoglyph impersonation in names and domains |
| Render the raw bytes when it matters | Security review tools should make invisible and reordering characters visible, not hide them |
The unifying principle is the same one that explains why verifying identity out of band matters: never let the display layer be your source of truth for a security decision. The bytes are the truth. The glyphs are a convenience that an attacker can control.
Why This Matters for Messaging
Secure messengers are a natural target for these techniques. Display names invite homoglyph impersonation, message bodies can carry invisible watermarks, and any system that matches text against a block or mute list inherits the evasion problem. Building one means treating every piece of inbound text as potentially adversarial at the encoding level — normalizing display names, surfacing invisible characters where a user is making a trust decision, and never letting a rendered string stand in for the verified cryptographic identity underneath it. The encryption protects the message in transit; defending against what the message is happens at a different layer entirely.