The Domain You Saw Wasn't the Domain You Got: Homograph Attacks Explained

In 2017, a security researcher registered the domain аррӏе.com and produced a working SSL certificate for it. Read it carefully — every character is Cyrillic, not Latin. In most browsers at the time, it rendered indistinguishably from apple.com. The class of attack hasn't gone away; it's just gotten quieter.

Domain names are supposed to be the human-readable layer of the internet's addressing system. They're how you check whether you've actually landed on your bank's site instead of an imitation. That trust assumption is more fragile than most users realize, because of a quirk of how internationalized domain names are encoded.

The vulnerability is called an IDN homograph attack — or more colloquially, Punycode phishing. Understanding it requires a brief tour through how non-Latin characters got into DNS in the first place.

How IDN Came to Exist

DNS was originally specified to handle a restricted character set: letters a-z, digits 0-9, and hyphen — the so-called LDH rule. That worked fine for English speakers and badly for everyone else. By the early 2000s, the IETF recognized the need to support domain names in scripts like Chinese, Arabic, Cyrillic, and Devanagari.

The solution, finalized in RFC 3490 (2003) and refined in RFC 5891 (2010), was clever. Rather than change the DNS protocol itself, IDN defines an encoding layer. Domain names containing non-ASCII characters are translated, at the application boundary, into an ASCII-compatible form called Punycode. The actual DNS lookup happens with the Punycode form. The display happens with the Unicode form.

For example, the domain москва.рф (Moscow.rf in Cyrillic) becomes xn--80adxhks.xn--p1ai in Punycode. Both forms refer to the same record in DNS, but only the Punycode form actually travels over the wire.

The Attack

The attack works because Unicode contains many characters that are visually indistinguishable from Latin letters but are technically different code points. A few examples:

Cyrillic а (U+0430) vs. Latin a (U+0061)
Cyrillic е (U+0435) vs. Latin e (U+0065)
Cyrillic о (U+043E) vs. Latin o (U+006F)
Greek ο (U+03BF) vs. Latin o (U+006F)

An attacker registers a domain like аpple.com where the first character is Cyrillic. In Punycode, this becomes xn--pple-43d.com — clearly a different domain from apple.com. But when displayed in Unicode form, it's identical.

The 2017 demonstration by researcher Xudong Zheng went further: a domain composed entirely of Cyrillic letters that happen to look exactly like the Latin letters in "apple." Modern browsers responded to that demonstration with stricter display policies, but the underlying attack class remains live.

Why this is worse than typo-squatting

Typo-squatted domains like arnazon.com (rn instead of m) at least look slightly off if you read carefully. Homograph domains can be pixel-perfect identical to the original. There is no version of "reading carefully" that defends you.

What Browsers Do About It

Modern browsers implement a series of heuristics to decide when to display a domain in Unicode form versus its raw Punycode form. The general logic is:

Single script per label. If a domain label mixes Cyrillic and Latin characters, browsers typically display the Punycode form (the ugly xn--) rather than the visually similar Unicode form.
Known "highly restrictive" script profile. Some browsers apply the Unicode Technical Standard #39 (Unicode Security Mechanisms) restricted profile, which limits which characters can appear together.
Top-level domain matching. A pure-Cyrillic label under .ru displays as Unicode; under .com the same label triggers Punycode display, because .com is treated as a Latin TLD.

These policies vary by browser and by version. Chrome's policy is documented in their IDN Policy and has been tightened multiple times. Firefox uses the network.IDN.show_punycode preference; Safari has its own internal logic. Mobile browsers, especially in-app browsers, are often less restrictive than desktop versions.

The TLD complication

Some TLD registries have their own restrictions on what scripts they accept. The .com registry, for instance, permits IDN registrations under a restrictive policy. Some country-code TLDs accept any script. Some refuse mixed scripts. The level of homograph protection you get depends partially on which TLD you're looking at.

Where Homograph Attacks Actually Land

Browser address bars get most of the attention, but they're not where homograph attacks tend to succeed in practice. The riskier surfaces are:

Email "From" addresses. Mail clients often display the friendly name and domain in different rendering contexts. A homograph in the sender domain can pass casual inspection.
Embedded links in messages. A link's display text is independent of its target URL. If you display the homograph and link to the homograph, the visual deception is in the display text itself.
Code dependencies and package names. npm, PyPI, and other package registries have seen typo-squatting and homograph attacks where a single character difference led developers to install malicious dependencies.
Cryptocurrency exchange addresses. Wallet address bars and exchange UI elements have been observed displaying homograph domains in support links.

Defenses That Actually Work

Defense	Effectiveness
Browser IDN display policies	Strong against pure-Cyrillic and mixed-script domains in major browsers
Password managers' domain matching	Excellent — password managers match on exact Punycode form, not visual rendering. If your password manager doesn't auto-fill, that's a warning sign.
Hardware security keys (WebAuthn)	Excellent — origin binding means the key won't authenticate to a different domain regardless of visual similarity
"Just read the URL carefully"	Fails by design — that's exactly the assumption homograph attacks break
Certificate transparency monitoring	Useful for brand protection (your company can watch for homograph registrations of its trademarks) but not for end-user defense

Practical Recommendations

For individual users:

Use a password manager and trust its autofill behavior. If you arrive at a "login page" and your manager doesn't recognize it, treat that as a strong signal that the domain is not what you think it is — regardless of how it looks.
Prefer hardware security keys for high-value accounts. Email, banking, registrar accounts. WebAuthn's origin binding is genuinely phishing-resistant in a way that no display heuristic can match.
Don't follow login links from email. Type the address you actually want into the browser, or use a stored bookmark. This is unfashionable advice but it remains the strongest defense.

For organizations:

Monitor CT logs for variants of your brand. Tools like Facebook's CT monitoring or commercial brand-protection services watch for new certificates that match suspicious patterns. Combine with proactive registration of obvious homograph variants of your trademarks.
Train staff on this specific attack class. Generic "watch out for phishing" advice is insufficient because it implies that visual inspection works.
Implement DMARC enforcement on outbound mail. See our piece on email authentication for why this matters for sender impersonation.

The general lesson is older than IDN: any time a system separates "how something is stored or transmitted" from "how it is displayed to a human," the gap becomes attackable. Cryptographic identity binding — passwords, hardware keys, certificate pinning — is the only reliable defense, because it operates on the stored form, not the display form.

Where Haven Fits

Haven's identity model uses Matrix-style IDs (@haven_username:havenmessenger.com) for chat, which are subject to the same homograph risks as any other text identifier. Our defenses are the standard ones: passkey-based authentication for the account itself, signature-key verification for contact identity, and a deliberate UI choice to display contact handles in a way that surfaces non-ASCII characters explicitly rather than relying on font rendering to mask them.

Related reading: TOFU key verification covers a parallel problem in cryptographic identity binding.