The joy of Unicode

Author

Clayton Cafiero

Published

2025-06-24

Here we explain a little bit about Unicode and why we may encounter UnicodeDecodeError or UnicodeEncodeError exceptions.

There are many different ways to encode letters and symbols. For example, let’s go back to Chapter 2, where we saw how the letter 'A' is encoded (see: ?@fig-int_as_bitstring). There we saw that the letter 'A' is encoded in binary as 01000001 which is numerically equivalent to decimal 65. We can confirm this in the shell:

>>> ord('A')
65

Here’s what wasn’t said in Chapter 2.

Python uses Unicode (UTF-8) encoding by default

Python, by default since 2008, uses what’s called UTF-8 encoding. That’s short for Unicode Transformation Format—8-bit, which is a mouthful. UTF-8 provides 1,112,064 different symbols. With Unicode we can write, display, and typeset symbols in different writing systems: alphabets, pictograms, mathematical symbols, musical notes, and even emoji. Unicode supports English, Chinese, Hindi, Arabic, Russian, and hundreds of other languages and writing systems. It even supports ancient writing systems with cuneiform and hieroglyphs. (For more, see: https://home.unicode.org/)

There’s more than one encoding system

While much of the world runs on UTF-8 these days, it’s not the only encoding system out there. Other systems in use include ISO-8859-1, Windows-1252, UTF-16, UTF-32, GB2312, Shift_JIS, GBK, EUC-KR, and Big5. Another term for encoding system is code page (think of a sheet of paper with all the characters represented in a given system along with their encoding in that system). Yikes!

So when I said the letter 'A' is encoded as binary representation of decimal 65, that’s true but more specifically 'A' is encoded as 65 in the UTF-8 encoding system. Now, many encoding systems have the same values for the letters of the English alphabet. The reason for this is historical—designers of these encoding systems wanted to maintain backward compatibility with the ASCII (American Standard Code for Information Interchange) standard, first published in 1963. But in the case of many other symbols, there may not be agreement between systems in terms of how a given symbol can be encoded. Indeed, there may not even be an encoding for a particular symbol in one system which can be represented just fine in another.

Here’s an example. Consider the string "Henderson’s Café". This contains a “curly” apostrophe, '’', and an accented symbol, the 'é'. In UTF-8 these are encoded as 8217 and 233 (decimal), respectively We refer to these values as a code points. So how could we encode these in ASCII? We can’t! There is no symbol '’' in the ASCII code page. There is no symbol 'é' in the ASCII code page. In fact, ASCII has only 128 code points in total whereas UTF-8 has 1,112,064. So the vast majority of symbols in UTF-8 can’t be encoded in ASCII at all.

If we try to encode a symbol using a code page that does not support that symbol we get an encoding error.

>>> s = "Henderson’s Café"
>>> s.encode('ascii')
Traceback (most recent call last):
  File "<python-input-4>", line 1, in <module>
    s.encode('ascii')
    ~~~~~~~~^^^^^^^^^
UnicodeEncodeError: 'ascii' codec can't encode character 
     '\u2019' in position 9: ordinal not in range(128)

Here it’s choking on the curly quote (notice it says position 9 and Python starts counting at zero).

Hey, wait a minute, you might think: You said that the curly apostrophe is at code point 8217 in UTF-8 but the error message says \u2019. What’s up with that? Let’s try printing that:

>>> print('\u2019')
’

So this prints a curly apostrophe. But what is \u2019? Strings prefixed by \u are called Unicode strings, and the number that follows is the Unicode code point in hexadecimal notation. Hexadecimal is a base-16 number format, and it is quite common. So 2019 (hexadecimal) is equivalent to 8217 (decimal). Here’s confirmation:

>>> int("2019", 16)   # interpret as base-16
8217

(By default, the int constructor interprets strings as base-10. Here we provide an additional argument, specifying the radix (base) to be used.)

Now let’s take a look at a relatively modern encoding system Windows-1252 (which dates from the 1990s like UTF-8). While it’s not as prevalent as UTF-8 (for example, under 1.5% of all web pages worldwide are encoded in this system), it is still in use.

Here we’ll encode the string "Henderson’s Café" using two different encodings: UTF-8 and Windows-1252.

>>> s = "Henderson’s Café"
>>> s.encode("utf-8")
b'Henderson\xe2\x80\x99s Caf\xc3\xa9'
>>> s.encode("cp1252")
b'Henderson\x92s Caf\xe9'

Notice the results—the encodings—are different. But, oh gosh, what now? What’s up with the b and the \x?

Let’s start with the b. Strings prefixed with b indicates that this string is to be interpreted as bytes. (This is a new type we’ve not seen before: bytes type.) What’s a byte? Typically eight bits (binary digits). So if we have eight bits, we can represent numbers up to 255 because 2^8 = 256, and we start at zero.

\x indicates that the following two digits are to be interpreted as hexadecimal. However, when encoding a symbol like '’' which has a Unicode code point of 2019 (hex) or equivalently 8217 (decimal) we need more than one byte (because a byte only goes up to 255). So we need a way of telling how many bytes we need and how to continue reading bytes as needed. (I know, I know, you’re thinking “All this for a curly apostrophe?” Alas, yes.) For example, let’s unpack \xe2\x80\x99 within that bytestring. Each of these is a hexidecimal number: e2 80 99. Now let’s look at these as binary 8-bit bytes:

>>> f"{ord(b'\xe2'):08b}"  # format specifier for 8-bit binary
'11100010'
>>> f"{ord(b'\x80'):08b}"
'10000000'
>>> f"{ord(b'\x99'):08b}"
'10011001'

But not all of these bits are character bits. When representing multi-byte symbols there are prefixes that indicate how to interpret them.

Prefix	Meaning
0xxxxxxx	One-byte (7-bit) characters (ASCII)
110xxxxx	Start of 2-byte sequence
1110xxxx	Start of 3-byte sequence
11110xxx	Start of 4-byte sequence
10xxxxxx	Continuation of multi-byte sequence

(I know, I know, you’re thinking “Seriously? All this for a curly apostrophe?” Alas, yes.)

Let’s go back to those three bitstrings. \xe2\x80\x99 is equivalent to 11100010 10000000 10011001. Notice that the first begins with 1110 that’s the prefix indicating the start of a 3-byte sequence. The remaining character bits (sometimes called “payload bits”) in that byte are 0010. The second byte, 10000000, starts with 10 indicating the continuation of a multi-byte sequence, and the remaining character bits are 000000. The third byte, 10011001, starts with 10—another continuation, with remaining character bits 011001. Now we take all those character bits and concatenate them to get 0010000000011001. What is that in decimal?

>>> int('0010000000011001', 2)   # interpret as base-2
8217

Voilà! 8217 as expected. Now what is this in hexadecimal?

>>> hex(8217)   # hex() is the hex constructor
'0x2019'

Boom! There’s the “2019” in '\u2019. That’s the UTF-8 code point for '’'.

Now let’s look at the same symbol with Windows-1252 encoding: just \x92 which is decimal 146.

>>> int("92", 16)
146

So in the case of UTF-8, the curly apostrophe, '’', is encoded with three bytes, whereas Windows-1252 encodes it with a single byte, with a very different value (at this point, you should see where this is heading).

>>> s = "Henderson’s Café"
>>> utf_encoded = s.encode("utf-8")
>>> win1252_encoded = s.encode("cp1252")

Now let’s try to decode the latter without specifying an encoding.

>>> win1252_encoded.decode()
Traceback (most recent call last):
  File "<python-input-26>", line 1, in <module>
    win1252_encoded.decode()
    ~~~~~~~~~~~~~~~~~~~~~~^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 
    0x92 in position 9: invalid start byte

Why? Because the default encoding in Python is UTF-8, and the bytestring that we got from encoding in Windows-1252 isn’t valid UTF-8. Now, you may ask, “Surely code point 146 is a valid code point in UTF-8. Why does this fail?” Great question. Let’s look at hexadecimal 92 (binary 146) in binary: 10010010. Now compare that with the prefixes in the table above. This starts with 10 which indicates continuation of a multi-byte string in UTF-8, but there was no start indicator for any multi-byte string! That’s why this is invalid, and that’s why we get a UnicodeDecodeError!

Seriously? Does it really work this way? Yup.

When do we encounter this kind of error? A common case is when trying to read from a file that was saved with a Windows-1252 without specifying the correct encoding. Here’s proof:

with open("test.txt", 'w', encoding='cp1252') as fh:
    fh.write("Henderson’s Café")
    
with open("test.txt") as fh:
    s = fh.read()

This fails with

Traceback (most recent call last):
  File "<string>", line 5, in <module>
  ...
UnicodeDecodeError: 'utf-8' codec can't decode byte 
     0x92 in position 9: invalid start byte

That’s one example.

How do we fix this? We specify the encoding used when opening the file for reading.

with open("test.txt", 'w', encoding='cp1252') as fh:
    fh.write("Henderson’s Café")
    
with open("test.txt", encoding='cp1252') as fh:
    s = fh.read()
    
print(s)

This prints “Henderson’s Café” with curly apostrophe and accented character as expected.

Reuse

CC BY-NC-SA 4.0