Before Dennis Snell started talking to me about Unicode, I thought displaying text on a screen was the most boring thing. I had no clue. It’s fascinating! And I’m sharing my favorite bits below.
Aha, if you wonder what are codepoints, code units etc. you may want to start with the short introduction to Unicode I wrote earlier.
Multi-character codepoints
Some Unicode codepoints are rendered as a single glyph with multiple characters. The ligature ff (U+FB00) is one of them. It looks like two letters, but they’re inseparable. You can’t select just one f without selecting the other.
Unicode defines a few types of these multi-grapheme glyphs:
- Ligatures: they make specific character combinations more visually appealing, for example:
ffi,ffl,fl, andfi. Beware! Some text editors might automatically substituteffiforffi, resulting in actually-separate grapheme clusters rendering as a single glyph. - Digraphs, such as DZ (
U+01F1), Lj, (U+01C8), nj (U+01CC). An interesting property of Lj and alike is that it’s neither lowercase, nor uppercase - Phonetic symbols, such as
ʦ(U+02A6) orʣ(U+02A3).
Combining Characters
I’ll use the symbol ◌ (U+25CC) to symbolize any character. You’re about to learn why.
◌̄ (U+0304) is called macron above. It’s not a standalone character. Instead, it sticks to the preceeding codepoint and modifies how that character is displayed, for example "a\u0304" is rendered as ā.
Here’s a little picture if you’re a visual thinker:

This macron is a diacritical mark, a part of the Unicode range U+0300 to U+036F. Another example is a tie (◌͡◌, U+0361), which combines two characters. For instance, "a\u0361a" becomes a͡a.
Comparing strings in JavaScript
Some graphemes can be represented in two ways:
- As a dedicated codepoint, e.g.
é(U+00e9) - As a base character followed by a diacritic codepoint, e.g.
éas ane(U+0065) followed by an acute accent (◌́,U+0301)
There’s a profound implication to this. Consider these two ways of encoding the word café:
"caf\u00e9""caf\u0065\u0301"
Can you guess what happens when we compare them?
console.log('café' === 'café')
// or, equivalently:
console.log("caf\u00e9" === "caf\u0065\u0301");
The answer is false. Javascript compares strings code unit by code unit, and both strings contain different code units.
To get true, we’d need both strings to normalize both strings to the same code units. Unicode defines four normalization algorithms for different purposes: NFC, NFD, NFKC, and NFKD. For our purposes, NFC normalization via the string.normalize() method will do the trick:
console.log('café'.normalize("NFC") === 'café'.normalize("NFC"))
// true
// "NFC" is the default so you can also skip the arguments:
console.log('café'.normalize() === 'café'.normalize())
// true
Stacking combining characters
Applying the same diacritic multiple times can lead to stacked visuals. For example, "\u0e01\u0e49\u0e49\u0e49" is displayed as ก้้้. How many accents can you stack this way? Yes:

Typing experience
Different text editors might offer varying interactions when dealing with combining characters. Try it yourself below. Put the text cursor at the end of the text field below and start pressing backspace:
In my chrome, the first <backspace> press deleted the accent and the second <backspace> press deleted the character. A different editor, however, might delete both the accent and the character in a single <backspace> press.
Unicode spec recommends a specific delete behavior, but each editor ultimately makes its own choices – Dennis once found a non-compliant behavior in VS Code. Who knows what may happen when you navigate the text with arrow keys or select a chunk of it!
Homoglyphs
Homoglyphs are characters that look identical but belong to different writing systems. For instance:
// Latin
[...'space'].map(s=>s.codePointAt(0))
// [115, 112, 97, 99, 101]
// Cyrylic
[...'ѕрасе'].map(s=>s.codePointAt(0))
// [1109, 1088, 1072, 1089, 1077]
These similarities are often exploited in phishing attacks and to bypass content filters. For more examples, refer to confusables.txt.
Zero-Width Characters
Certain codepoints don’t produce visible symbols but influence text rendering.
Zero-width joiner, or ZWJ (U+200D) combines two codepoints into a single visible glyph. For example, combining emojis:
// Two separate emojis:
console.log("🐻" + "❄️")
// '🐻❄️'
// With a zero-width joiner, they become a polar bear:
"🐻\u200D❄️"
// '🐻❄️'
Stacking multiple ZWJs can create complex emojis, like a family:
// Way 1: People emojis combined with ZWJs.
// 5 codepoints in total!
"👨\u200D👩\u200D👦"
// '👨👩👦👦'
These combinations, however, can be quite byte-heavy:
new TextEncoder().encode('👨👩👦').length
// 18
In contrast, the dedicated family emoji codepoint U+1F46A can be encoded only 4 bytes:
new TextEncoder().encode('👪').length
// 4
Flag emojis, however, don’t have dedicated codepoints and are expressed with two adjacent regional indicator codepoints: 1
'🇩'+'🇪'
// 🇩🇪
Note there’s no zero-width joiner here. 🇩 (U+1F1E9) and 🇪 (U+1F1EA) codepoints fuse together by default. If you want to display country code 🇩🇪 and not a flag, you need to use…
Zero-width non-joiner, or ZWNJ (U+200C) is the opposite of zero-width joiner. It prevents two codepoints from combining when they would combine by default:
'🇩\u200c🇪'
// 🇩🇪
Zero-width space (U+200B) indicates a place where a single, long word can be safely broken into multiple lines. It’s useful for browsing Java codebases on small mobile screens 😉
Variation selectors (U+FE00 to U+FE0F) select which graphical variation of the preceding character is displayed. For symbols that have both text and emoji presentation, selector 15 (U+FE0E) selects the text presentation and selector 16 (U+FE0F) selects the emocji presentation:
// variation selector 15 – text presentation
'\u25B6\uFE0E'
// '▶'
// Variation selector 16 – emoji presentation
'\u25B6\uFE0F'
// '▶️'
Bidirectional characters are what make it possible to mix left-to-right (like English) and right-to-left (like Hebrew or Arabic) text in the same sentence. Most of the time, it just works. But sometimes, things get weird.
Imagine your website has a little search feature. It tells the user how many times their query appeared on the page. Works great—until someone searches in Hebrew:
const query = `קידוד טקסט`;
console.log(`We found ${query} 13 times on the page`);
// We found קידוד טקסט 13 times on the page
Looks fine at first glance… but wait—why is the number 13 sitting in the wrong place? It’s been sucked into the Hebrew direction and flipped to the left!
To fix this, we can set a boundary between Latin and Hebrew scripts using two special codepoints: Right-to-Left Isolate (U+2067) and Pop Directional Isolate (U+2069).
const query = `קידוד טקסט`;
console.log(`We found \u2067${query}\u2069 13 times on the page`);
// We found קידוד טקסט 13 times on the page
Much better—everything stays where it should.
But these invisible direction characters can be dangerous, too. There’s an exploit called Trojan Source that hides malicious behavior in plain sight using them:
`if(role === "user\u202E \u2066// Check if admin\u2069 \u2066") {`
// if(role === "user // Check if admin ") {
To a human reading the code, it looks like a harmless comment. But to the compiler, it’s something completely different.
If you’re curious (or paranoid), check out the full Unicode Bidirectional Algorithm.
Private Use Areas and Noncharacters
Unicode isn’t just about official letters and symbols—it also leaves some space for… well, anything you want.
Private-use characters are codepoints that Unicode intentionally leaves blank – your app can use them to define your own symbols.
Noncharacters (e.g. U+FFFFE) are codepoints that are permanently reserved and will never represent any symbol or letter. They’re useful for, day, data processing to signal things like end-of-stream, but they’re never meant to show up in your text.
And that’s it! Isn’t Unicode wonderful?
Big thanks to Dennis Snell for reviewing this post and sharing even more Unicode wisdom.
- Some flags can be represented with tags or ZWJ, e.g.
"🏴\u200D☠️"yields a pirate flag🏴☠️↩︎
Leave a Reply