Discovering Unicode: Codepoints, Combining Characters & More

Unicode Facts You Should Know

Posted by

April 22, 2025

Before Dennis Snell started talking to me about Unicode, I thought displaying text on a screen was the most boring thing. I had no clue. It’s fascinating! And I’m sharing my favorite bits below.

Aha, if you wonder what are codepoints, code units etc. you may want to start with the short introduction to Unicode I wrote earlier.

Multi-character codepoints

Some Unicode codepoints are rendered as a single glyph with multiple characters. The ligature ﬀ (U+FB00) is one of them. It looks like two letters, but they’re inseparable. You can’t select just one f without selecting the other.

Unicode defines a few types of these multi-grapheme glyphs:

Ligatures: they make specific character combinations more visually appealing, for example: ﬃ, ﬄ, ﬂ, and ﬁ. Beware! Some text editors might automatically substitute ffi for ﬃ, resulting in actually-separate grapheme clusters rendering as a single glyph.
Digraphs, such as Ǳ (U+01F1), ǈ, (U+01C8), ǌ (U+01CC ). An interesting property of ǈ and alike is that it’s neither lowercase, nor uppercase
Phonetic symbols, such as ʦ (U+02A6) or ʣ (U+02A3).

Combining Characters

I’ll use the symbol ◌ (U+25CC) to symbolize any character. You’re about to learn why.

◌̄ (U+0304) is called macron above. It’s not a standalone character. Instead, it sticks to the preceeding codepoint and modifies how that character is displayed, for example "a\u0304" is rendered as ā.

Here’s a little picture if you’re a visual thinker:

Illustration showing the addition of a combining macron diacritic to the letter 'a', resulting in the character 'ā'.

This macron is a diacritical mark, a part of the Unicode range U+0300 to U+036F. Another example is a tie (◌͡◌, U+0361), which combines two characters. For instance, "a\u0361a" becomes a͡a.

Comparing strings in JavaScript

Some graphemes can be represented in two ways:

As a dedicated codepoint, e.g. é (U+00e9)
As a base character followed by a diacritic codepoint, e.g. é as an e (U+0065) followed by an acute accent (◌́, U+0301)

There’s a profound implication to this. Consider these two ways of encoding the word café:

"caf\u00e9"
"caf\u0065\u0301"

Can you guess what happens when we compare them?

console.log('café' === 'café')
// or, equivalently:
console.log("caf\u00e9" === "caf\u0065\u0301");

The answer is false. Javascript compares strings code unit by code unit, and both strings contain different code units.

To get true, we’d need both strings to normalize both strings to the same code units. Unicode defines four normalization algorithms for different purposes: NFC, NFD, NFKC, and NFKD. For our purposes, NFC normalization via the string.normalize() method will do the trick:

console.log('café'.normalize("NFC") === 'café'.normalize("NFC"))
// true

// "NFC" is the default so you can also skip the arguments:
console.log('café'.normalize() === 'café'.normalize())
// true

Stacking combining characters

Applying the same diacritic multiple times can lead to stacked visuals. For example, "\u0e01\u0e49\u0e49\u0e49" is displayed as ก้้้. How many accents can you stack this way? Yes:

Screenshot of a StackOverflow question discussing Unicode combining characters and how to filter them, including user engagement metrics and tags related to the topic.

Typing experience

Different text editors might offer varying interactions when dealing with combining characters. Try it yourself below. Put the text cursor at the end of the text field below and start pressing backspace:

In my chrome, the first <backspace> press deleted the accent and the second <backspace> press deleted the character. A different editor, however, might delete both the accent and the character in a single <backspace> press.

Unicode spec recommends a specific delete behavior, but each editor ultimately makes its own choices – Dennis once found a non-compliant behavior in VS Code. Who knows what may happen when you navigate the text with arrow keys or select a chunk of it!

Homoglyphs

Homoglyphs are characters that look identical but belong to different writing systems. For instance:

// Latin
[...'space'].map(s=>s.codePointAt(0))
// [115, 112, 97, 99, 101]

// Cyrylic
[...'ѕрасе'].map(s=>s.codePointAt(0))
// [1109, 1088, 1072, 1089, 1077]

These similarities are often exploited in phishing attacks and to bypass content filters. For more examples, refer to confusables.txt.

Zero-Width Characters

Certain codepoints don’t produce visible symbols but influence text rendering.

Zero-width joiner, or ZWJ (U+200D) combines two codepoints into a single visible glyph. For example, combining emojis:

// Two separate emojis:
console.log("🐻" + "❄️")
// '🐻❄️'

// With a zero-width joiner, they become a polar bear:
"🐻\u200D❄️"
// '🐻‍❄️'

Stacking multiple ZWJs can create complex emojis, like a family:

// Way 1: People emojis combined with ZWJs.
// 5 codepoints in total!
"👨\u200D👩\u200D👦"
// '👨‍👩‍👦‍👦‍'

These combinations, however, can be quite byte-heavy:

new TextEncoder().encode('👨‍👩‍👦').length
// 18

In contrast, the dedicated family emoji codepoint U+1F46A can be encoded only 4 bytes:

new TextEncoder().encode('👪').length
// 4

Flag emojis, however, don’t have dedicated codepoints and are expressed with two adjacent regional indicator codepoints: ¹

'🇩'+'🇪'
// 🇩🇪

Note there’s no zero-width joiner here. 🇩‌ (U+1F1E9) and 🇪 (U+1F1EA) codepoints fuse together by default. If you want to display country code 🇩‌🇪 and not a flag, you need to use…

Zero-width non-joiner, or ZWNJ (U+200C) is the opposite of zero-width joiner. It prevents two codepoints from combining when they would combine by default:

'🇩\u200c🇪'
// 🇩‌🇪

Zero-width space (U+200B) indicates a place where a single, long word can be safely broken into multiple lines. It’s useful for browsing Java codebases on small mobile screens 😉

Variation selectors (U+FE00 to U+FE0F) select which graphical variation of the preceding character is displayed. For symbols that have both text and emoji presentation, selector 15 (U+FE0E) selects the text presentation and selector 16 (U+FE0F) selects the emocji presentation:

// variation selector 15 – text presentation
'\u25B6\uFE0E'
// '▶'

// Variation selector 16 – emoji presentation
'\u25B6\uFE0F'
// '▶️'

Bidirectional characters are what make it possible to mix left-to-right (like English) and right-to-left (like Hebrew or Arabic) text in the same sentence. Most of the time, it just works. But sometimes, things get weird.

Imagine your website has a little search feature. It tells the user how many times their query appeared on the page. Works great—until someone searches in Hebrew:

const query = `קידוד טקסט`;
console.log(`We found ${query} 13 times on the page`);
// We found קידוד טקסט 13 times on the page

Looks fine at first glance… but wait—why is the number 13 sitting in the wrong place? It’s been sucked into the Hebrew direction and flipped to the left!

To fix this, we can set a boundary between Latin and Hebrew scripts using two special codepoints: Right-to-Left Isolate (U+2067) and Pop Directional Isolate (U+2069).

const query = `קידוד טקסט`;
console.log(`We found \u2067${query}\u2069 13 times on the page`);
// We found ‏קידוד טקסט‎ 13 times on the page

Much better—everything stays where it should.

But these invisible direction characters can be dangerous, too. There’s an exploit called Trojan Source that hides malicious behavior in plain sight using them:

`if(role === "user\u202E \u2066// Check if admin\u2069 \u2066") {`
// if(role === "user‮ ⁦// Check if admin⁩ ⁦") {

To a human reading the code, it looks like a harmless comment. But to the compiler, it’s something completely different.

If you’re curious (or paranoid), check out the full Unicode Bidirectional Algorithm.

Private Use Areas and Noncharacters

Unicode isn’t just about official letters and symbols—it also leaves some space for… well, anything you want.

Private-use characters are codepoints that Unicode intentionally leaves blank – your app can use them to define your own symbols.

Noncharacters (e.g. U+FFFFE) are codepoints that are permanently reserved and will never represent any symbol or letter. They’re useful for, day, data processing to signal things like end-of-stream, but they’re never meant to show up in your text.

And that’s it! Isn’t Unicode wonderful?

Big thanks to Dennis Snell for reviewing this post and sharing even more Unicode wisdom.

Some flags can be represented with tags or ZWJ, e.g. "🏴\u200D☠️" yields a pirate flag 🏴‍☠️ ↩︎

Adam's Perspective

Unicode Facts You Should Know

Multi-character codepoints

Combining Characters

Homoglyphs

Zero-Width Characters

Private Use Areas and Noncharacters

Like this:

Leave a ReplyCancel reply

Unicode Facts You Should Know

Multi-character codepoints

Combining Characters

Homoglyphs

Zero-Width Characters

Private Use Areas and Noncharacters

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from Adam's Perspective