Short introduction to Unicode and UTF-8

Adam Zieliński Avatar

Posted by

on

I’m fascinated with Unicode. This blog post is one I wish I had 20 years ago when I was starting to learn about text encoding.

Around 2005, I struggled to use Polish words on my website. I would type in koło in Windows notepad, but in Firefox I would see ko³o. Why? Notepad used the windows-1250 standard to save the file, but the browser used ISO 8859-1 to interpret it. The only way to get it right was creating and opening index.html using the same standard. And even then, I could not use the Chinese word 成功 – it wasn’t a part ofISO-8859-1.

Fast forward to 2025, Unicode is the de-facto standard for text handling in digital systems. With Unicode, you can use 37 different writing systems on a single website and they would just work. Unicode can express probably every character you can think of, right-to-left languages, math symbols, and so much more. But how exactly does it work?

Codepoints

Unicode maps every character to a unique number between 0 and 1,114,111 [1]. These numbers are called codepoints. Here’s a few codepoints – I’ll use the hexadecimal notation throughout this article:

CharacterCodepoint (decimal)Codepoint (hex)
A6541
B6642
C6743
ó322F3
107802A 1C
2510462 10

For brevity, U+6210 is often used to refer to Unicode codepoint 62 10 (hex).

Codepoints go beyond printable characters. The codepoint U+0301 adds an accent to the preceeding character (e vs é), the codepoint U+200F changes the directionality of the text to right-to-left, and so the list goes.

Here’s how you can play with codepoints in JavaScript:

// Check codepoints of characters in a string:
"A".codePointAt(0)
// 65

"⨜".codePointAt(0)
// 10780

// Create strings from codepoints:
String.fromCodePoint(25104)
// '成'

// Same thing, but using hexadecimal numbers:
String.fromCodePoint(0x6210)
// '成'

Graphemes

A grapheme is the smallest unit in a writing system – punctuation marks, Latin letters, ligatures, Chinese characters, and emojis are all graphemes.

A grapheme cluster is a sequence of unicode codepoints that expresses a single, user-perceived character. For example, g (U+0067) combined with grapheme ̈  (U+0308) gives a grapheme cluster .

Bytes

Computer stores text data as bytes. A single byte is a number between 0 and 255. The word BAY, represented with Unicode codepoints U+42 U+43 U+59, could be stored in computer’s memory using three bytes with those exact values: 42, 43, and 59.

Many regional text standards, such as ISO 8859-1, only define 256 codepoints – one for each possible byte value. Every character is expressed as a single byte. Unicode, however, defines more than a million codepoints. The computer could not possibly store every codepoint using a single byte, so most codepoints are represented using multiple bytes. This is why you can fit 255 letters in a single SMS, but not 255 emojis.

Text encoding

The translation between codepoints and byte sequences is called text encoding.

The most popular Unicode encoding is UTF-8. It’s not the only option, though. There’s also UTF-16, UTF-32, GB18030, and others. The names may be similar, but these encodings are very different from each other. We won’t explore these differences in here.

Here’s an oversimplified metaphore: Imagine encodings as columns in our character <=> codepoint table.

CharacterCodepoint (hex)UTF-8 Encoded bytesUTF-16-le Encoded bytes
A414141 00
B424242 00
C434343 00
óF3C3 B3F3 00
2A 1CE2 A8 9C1C 2A
62 10E6 88 9010 62
🥵01 F9 75F0 9F A5 B5D8 3E DD 75

In practice, there’s more to it. Some characters can be represented in multiple ways. Also, UTF-8 has rules to recover when processing corrupted data. Again, we won’t go into all these deep details in this post.

Here’s how you can play with text encoding in JavaScript:

// TextEncoder transforms a string to a Uint8Array
// of UTF-8 bytes
const bytes = Array.from(new TextEncoder().encode("ABCó"));
// [65, 66, 67, 195, 179]

// Convert decimal to hex:
bytes.map(n => n.toString(16))
// ['41', '42', '43', 'c3', 'b3']

Code units

A code unit is the smallest chunk an encoding uses to spell out a Unicode code‑point. In UTF-8, that chunk is 8 bit and in UTF-16 it’s 16 bit.

Tangentially, JavaScript encodes every string as UTF-16 which sets up a trap:

'成'.length === 1
'🥵'.length === 2

What’s going on?

The string.length property reports how many 16 bit units are stored in string.

Why is that? 🥵 codepoint is three bytes, or 24 bits: 01 F9 75. It cannot bit in 16 bits, so UTF-16 adds another 16 bit code unit for a total of 32 bits. Tricky!

If you are a JS developer trying to count printable characters, see the custom getGraphemeCount function on MDN. To count codepoints, use the string iterator [..."🥵"].length === 1.

Text decoding

Imagine some text stored in computer memory. You peek and find the following sequence of bytes:

62 6f 72 c3 b3 77 6b 61

What does this text say?

The answer is: it depends. What’s the encoding? We can’t interpret the numbers without knowing the rules. It would be like interpreting words from a language you know nothing about.

Decode these bytes as UTF-8, and you’ll get the Polish word borówka. But decode them as UTF-16, and you’ll get 潢썲瞳慫. And in EBCDIC, that byte sequence becomes Â?ÊC·Ï,/.

But suppose we decided these bytes were a 64 bit integer – then we’d read them as a decimal number 7093014122986957665. Interpreting bytes as numeric types is really the same problem. A sequence of 64 bits means different things when we view it as long long, unsigned long long or double. I wish more programming languages encouraged encoding-aware string types such as string<utf-8>, string<utf-16> etc. to encourage validation and prevent accidental concatenation.

Here’s how you can decode that byte sequence as UTF-8 in JavaScript:

const bytes = new Uint8Array([
    0x62, 0x6f, 0x72, 0xc3,
    0xb3, 0x77, 0x6b, 0x61
]);
new TextDecoder().decode(bytes)
// borówka

Double text encoding = trouble!

Even on UTF-8 websites, I still occasionally see ko³o instead of koło. One way this could happen is double encoding.

The word koło can be expressed as Unicode codepoints 6B 6F 142 6F . Encoding that as UTF-8 bytes yields 6B 6F C5 82 6F where the letter ł is represented by C5 82.

Now, suppose there’s a bug in the system and those bytes got interpreted as codepoints. We’re in trouble! The codepoint C5 stands for Å, and the codepoint 82 stands for ³ . Encoding that as UTF-8 yields 6B 6F C3 85 C2 B3 6F, or koųo. Yikes!

Closing thoughts

I hope this quick primer was fun. I plan to write a few more posts about text encoding and its gotchas as I find the topic increasingly mind-blowing. One topic I hope to explore soon is Unicode normalization—why two visually identical strings can actually differ in bytes. See you in the next one!

Leave a Reply

Discover more from Adam's Perspective

Subscribe now to keep reading and get access to the full archive.

Continue reading