cognitive science
and more
Intro Bio Psy
Advertisement
A simple explanation of character encoding in Python

Special characters are a pain for programmers, and a source of bugs. Whenever your code contains an accented letter, voil├Ā, funny-looking symbols appear; or, worse, your program crashes with an obscure error. Why does it have to be that hard!?

It doesn't. Character encoding is not that difficult once you understand the basic principles. Let's take a look.

Characters are interpretations

For a computer, there is no text, only bytes. Text is an interpretation, a way to visualize bytes in human-readable form. And if you interpret bytes wrong, you get funky-looking characters, or even an error. A character encoding is one specific way of interpreting bytes: It's a look-up table that says, for example, that a byte with the value 97 stands for 'a'.

Now let's consider these three bytes:

195-167-97

In Python 21, this is a str object: a series of bytes without any information about how they should be interpreted. Don't let the name confuse you: A str is not a string of characters; rather, it's a series of bytes that you can interpret as a string of characters. But the proper interpretation requires the proper encoding, and the problem is: A str doesn't know its own encoding!

Now let's turn to actual Python:

# Mystery string!
byte_string = chr(195) + chr(167) + chr(97)

We have three bytes (195-167-97), but no idea what they stand for. How can we decipher these bytes into actual text? Well, we need an encoding! Let's start by trying the most basic encoding: ascii. We can enforce this encoding using the decode() function. Technically, decode() turns a str object into a unicode object. I will explain unicode objects later; for now, just think of decode() as a way to interpret bytes using a specific encoding.

byte_string = chr(195) + chr(167) + chr(97)
unicode_string = byte_string.decode('ascii') # Gives UnicodeDecodeError

Error! Error! So what went wrong here?

Well, ascii has no characters that correspond to the bytes 195 and 167; ascii only has 128 characters, that is, those between 0 and 127. These 128 characters essentially correspond to the Latin alphabet plus some punctuation characters—all you need if you're an Anglo-Saxon with a blatant disregard for the rest of the world!

You now know where encoding errors (of the crashing kind) generally come from: They result from trying to interpret bytes with an encoding (usually ascii) that doesn't define all the bytes that you want to interpret.

Now let's try another encoding: latin-1. According to latin-1, 195 stands for 'Ã', 167 stands for '§', and 97 stands for 'a'.

byte_string = chr(195) + chr(167) + chr(97)
unicode_string = byte_string.decode('latin-1')
print(unicode_string) # prints: ça

No error! That's progress, but things still don't make much sense: 'ça' is funky, and not meaningful text. This suggests that, even though our program didn't crash, we still haven't found the correct encoding.

You now know where funky symbols generally come from: They result from interpreting bytes with an encoding that seems to fit (i.e. it doesn't crash) but really doesn't.

Now let's try yet another encoding: utf-8. According to utf-8, 195 and 167 together make up a 'ç', and 97 makes up 'a'. So utf-8 and latin-1 (and even ascii) agree on the 'a', which is a standard character without accents; but they disagree on the 'ç' / 'ç', which are 'special' characters. This is typical, and explains why you only get encoding issues with special characters: That's where the different encodings disagree.

So what does utf-8 give us?

byte_string = chr(195) + chr(167) + chr(97)
unicode_string = byte_string.decode('utf-8')
print(unicode_string) # prints: ça

Now things start to make sense: 'ça' is the French word for 'that'. So utf-8 seems to be the correct encoding. We couldn't possibly have known this in advance; we only found out because we tried a few different encodings, and decided that utf-8 makes most sense. And we could still be wrong; after all, two characters is not a lot of data to base a conclusion on.

There is no full-proof way to find out how a series of bytes is encoded. Someone either needs to tell you ("I just sent you a text file, and it's utf-8 encoded."), or you need to figure it out based on various clues: which encodings are common, which encodings give sensible results, etc.

So how can it be that text editors often correctly display text files with accents and other special characters? What do text editors know that we don't? The answer is: nothing, text editors just make educated guesses. Some text editors do what we just did: They try a few different encodings, and see which one works best, for example by excluding encodings that result in unusual characters. Other text editors simply have a default encoding, and if this doesn't fit: ¡Qué pena! Yet other text editors ask politely which encoding you want to use. But, at the end of the day, text editors are in the same position as we are: They cannot possibly know the encoding of a text file for sure, because that information is simply not part of the bytes that make up the text file.

Meet unicode

By now you might be wondering: Why doesn't text contain some information about which encoding is used? Well ... that's exactly where Python's unicode object comes in: unicode allows you to use text without worrying about encoding. Of course, unicode objects are still bytes deep down, but these bytes are unambiguously linked to characters; unlike str objects, unicode objects know their own encoding. (See Footnote 22 for a discussion of the various uses of the term 'unicode'.)

So unicode objects are nice, and you should use them. But how? Well, if you include text string directly in your code, you can simply turn them into unicode by prefixing a u. You should always do that, unless there is a clear reason not to.

unicode_string = u'نا أحب يونيكود'

In many cases, text enters your program as a str object, for example when you read text from a file. In that case, you should turn the incoming str into unicode as soon as possible, using decode(). But—and here's the catch—you need to specify the encoding. There's never any way of getting around that.

So let's see how you read a text file, and, assuming that the text file is utf-8 encoded, safely turn it into a unicode object:

with open(u'some.txt') as fd:
    byte_string = fd.read() # Dangerous ...
unicode_string = byte_string.decode(u'utf-8') # Safe!

The inverse also happens: Text may leave your program as a str object, for example when you write it to a file. In that case, you should turn the unicode into a str at the very last moment, using encode(). Again, you need to specify how you want your text to be encoded when it leaves your program.

So let's see how you can write to utf-8-encoded text file:

byte_string = unicode_string.encode(u'utf-8')
with open(u'some.txt', u'w') as fd:
    fd.write(byte_string)

So the basic tricks are:

  1. Know the encoding of str objects that come into your program.
  2. Decode str to unicode as soon as possible (with decode()).
  3. Internally, use only unicode objects in your program.
  4. Know the encoding of str objects that leave your program.
  5. Encode unicode to str at the last moment (with encode()).

Understanding common errors

Let's consider three ways of concatenating two chunks of text:

unicode_string = u'tést' + u'tést' # Doesn't crash
byte_string = 'tést' + 'tést' # Doesn't crash
unicode_string = u'tést' + 'tést' # Results in UnicodeDecodeError

So: You can concatenate two unicode objects; and you can concatenate two str objects; but you cannot concatenate a unicode and a str object! (If there are special characters involved.) Why not?

The reason is simple: To concatenate a str and a unicode, the str needs to be converted to unicode. And, because no encoding is specified for this conversion, Python falls back to ascii, which doesn't work with special characters!

So this:

unicode_string = u'tést' + 'tést'

... is equivalent to this:

unicode_string = u'tést' + 'tést'.decode('ascii')

The solution, of course, is to explicitly (and correctly) indicate the encoding of the str object:

unicode_string = u'tést' + 'tést'.decode('utf-8')

Python automatically converts types in many situations. This is often useful, for example when you're adding an int to a float number. But if you have a mix of unicode and str objects, these implicit conversions often result in UnicodeDecodeErrors and UnicodeEncodeErrors.

This is why should be consistent and use unicode everywhere: Don't change a single str into a unicode and hope that all will be well. This may just make things worse, because you're mixing str and unicode!

Another common mistake is the following:

byte_string = 'tést'
byte_string = byte_string.encode('utf-8') # Results in UnicodeDecodeError

People often write code like this when they have a gut feeling that a str should be utf-8 encoded, but don't really grasp the underlying mechanics. What happens here is the following: First, byte_string is implicitly decoded to unicode, using an ascii encoding; next, the unicode is encoded back to a str, using utf-8. In other words, the code is equivalent to:

byte_string = 'tést'
byte_string = byte_string.decode('ascii').encode('utf-8')

And what fails is the decode('ascii') part, because the 'é' cannot be decoded using ascii. I personally find it misleading that str objects even have an encode() function, and that unicode objects have a decode() function. But just remember the following rule:

  • use decode() only to go from str to unicode; and
  • use encode() only to go from unicode to str.

  1. In this post, I focus on Python 2, but in Python 3 the situation is similar. The main difference is that Python-3 str objects are mostly equivalent to Python-2 unicode objects, and Python-3 bytes objects are mostly equivalent to Python-2 str objects. The Python-3 names make more sense. But the fact that str is used for fundamentally different types of objects in Python 2 and 3 is insane. 

  2. There is a lot of confusion about what Unicode means, and how it relates to UTF-8. But it's quite simple. In Python, unicode is an object type that is used to unambiguously represent text—but this is a Python-specific use of the term. More generally, Unicode is a standardized look-up table that links values to characters; for example, Unicode says that the value 233 stands for 'é'. UTF-8 is an implementation of Unicode. There are many technically different ways in which the Unicode look-up table can be implemented, and UTF-8 is one of those. So you could say that Unicode is the conceptual part of the encoding, and UTF-8 (and other implementations) is the technical part. Therefore, it doesn't make sense to say that a text file is Unicode-encoded--that's too vague. You need to know exactly how it's encoded, that is, which technical implementation of Unicode is used (usually UTF-8).