Special characters are a pain for programmers, and a source of bugs. Whenever your code contains an accented letter, voil├Ā, funny-looking symbols appear; or, worse, your program crashes with an obscure error. Why does it have to be that hard!?
It doesn't. Character encoding is not that difficult once you understand the basic principles. Let's take a look.
Characters are interpretations
For a computer, there is no text, only bytes. Text is an interpretation, a way to visualize bytes in human-readable form. And if you interpret bytes wrong, you get funky-looking characters, or even an error. A character encoding is one specific way of interpreting bytes: It's a look-up table that says, for example, that a byte with the value 97 stands for 'a'.
Now let's consider these three bytes:
In Python 21, this is a
str object: a series of bytes without any information about how they should be interpreted. Don't let the name confuse you: A
str is not a string of characters; rather, it's a series of bytes that you can interpret as a string of characters. But the proper interpretation requires the proper encoding, and the problem is: A
str doesn't know its own encoding!
Now let's turn to actual Python:
# Mystery string! byte_string = chr(195) + chr(167) + chr(97)
We have three bytes (195-167-97), but no idea what they stand for. How can we decipher these bytes into actual text? Well, we need an encoding! Let's start by trying the most basic encoding: ascii. We can enforce this encoding using the
decode() function. Technically,
decode() turns a
str object into a
unicode object. I will explain
unicode objects later; for now, just think of
decode() as a way to interpret bytes using a specific encoding.
byte_string = chr(195) + chr(167) + chr(97) unicode_string = byte_string.decode('ascii') # Gives UnicodeDecodeError
Error! Error! So what went wrong here?
Well, ascii has no characters that correspond to the bytes 195 and 167; ascii only has 128 characters, that is, those between 0 and 127. These 128 characters essentially correspond to the Latin alphabet plus some punctuation characters—all you need if you're an Anglo-Saxon with a blatant disregard for the rest of the world!
You now know where encoding errors (of the crashing kind) generally come from: They result from trying to interpret bytes with an encoding (usually ascii) that doesn't define all the bytes that you want to interpret.
Now let's try another encoding: latin-1. According to latin-1, 195 stands for 'Ã', 167 stands for '§', and 97 stands for 'a'.
byte_string = chr(195) + chr(167) + chr(97) unicode_string = byte_string.decode('latin-1') print(unicode_string) # prints: Ã§a
No error! That's progress, but things still don't make much sense: 'Ã§a' is funky, and not meaningful text. This suggests that, even though our program didn't crash, we still haven't found the correct encoding.
You now know where funky symbols generally come from: They result from interpreting bytes with an encoding that seems to fit (i.e. it doesn't crash) but really doesn't.
Now let's try yet another encoding: utf-8. According to utf-8, 195 and 167 together make up a 'ç', and 97 makes up 'a'. So utf-8 and latin-1 (and even ascii) agree on the 'a', which is a standard character without accents; but they disagree on the 'ç' / 'Ã§', which are 'special' characters. This is typical, and explains why you only get encoding issues with special characters: That's where the different encodings disagree.
So what does utf-8 give us?
byte_string = chr(195) + chr(167) + chr(97) unicode_string = byte_string.decode('utf-8') print(unicode_string) # prints: ça
Now things start to make sense: 'ça' is the French word for 'that'. So utf-8 seems to be the correct encoding. We couldn't possibly have known this in advance; we only found out because we tried a few different encodings, and decided that utf-8 makes most sense. And we could still be wrong; after all, two characters is not a lot of data to base a conclusion on.
There is no full-proof way to find out how a series of bytes is encoded. Someone either needs to tell you ("I just sent you a text file, and it's utf-8 encoded."), or you need to figure it out based on various clues: which encodings are common, which encodings give sensible results, etc.
So how can it be that text editors often correctly display text files with accents and other special characters? What do text editors know that we don't? The answer is: nothing, text editors just make educated guesses. Some text editors do what we just did: They try a few different encodings, and see which one works best, for example by excluding encodings that result in unusual characters. Other text editors simply have a default encoding, and if this doesn't fit: Â¡QuÃ© pena! Yet other text editors ask politely which encoding you want to use. But, at the end of the day, text editors are in the same position as we are: They cannot possibly know the encoding of a text file for sure, because that information is simply not part of the bytes that make up the text file.
By now you might be wondering: Why doesn't text contain some information about which encoding is used? Well ... that's exactly where Python's
unicode object comes in:
unicode allows you to use text without worrying about encoding. Of course,
unicode objects are still bytes deep down, but these bytes are unambiguously linked to characters; unlike
unicode objects know their own encoding. (See Footnote 22 for a discussion of the various uses of the term 'unicode'.)
unicode objects are nice, and you should use them. But how? Well, if you include text string directly in your code, you can simply turn them into
unicode by prefixing a
u. You should always do that, unless there is a clear reason not to.
unicode_string = u'نا أحب يونيكود'
In many cases, text enters your program as a
str object, for example when you read text from a file. In that case, you should turn the incoming
unicode as soon as possible, using
decode(). But—and here's the catch—you need to specify the encoding. There's never any way of getting around that.
So let's see how you read a text file, and, assuming that the text file is utf-8 encoded, safely turn it into a
with open(u'some.txt') as fd: byte_string = fd.read() # Dangerous ... unicode_string = byte_string.decode(u'utf-8') # Safe!
The inverse also happens: Text may leave your program as a
str object, for example when you write it to a file. In that case, you should turn the
unicode into a
str at the very last moment, using
encode(). Again, you need to specify how you want your text to be encoded when it leaves your program.
So let's see how you can write to utf-8-encoded text file:
byte_string = unicode_string.encode(u'utf-8') with open(u'some.txt', u'w') as fd: fd.write(byte_string)
So the basic tricks are:
- Know the encoding of
strobjects that come into your program.
unicodeas soon as possible (with
- Internally, use only
unicodeobjects in your program.
- Know the encoding of
strobjects that leave your program.
strat the last moment (with
Understanding common errors
Let's consider three ways of concatenating two chunks of text:
unicode_string = u'tést' + u'tést' # Doesn't crash byte_string = 'tést' + 'tést' # Doesn't crash unicode_string = u'tést' + 'tést' # Results in UnicodeDecodeError
So: You can concatenate two
unicode objects; and you can concatenate two
str objects; but you cannot concatenate a
unicode and a
str object! (If there are special characters involved.) Why not?
The reason is simple: To concatenate a
str and a
str needs to be converted to
unicode. And, because no encoding is specified for this conversion, Python falls back to ascii, which doesn't work with special characters!
unicode_string = u'tést' + 'tést'
... is equivalent to this:
unicode_string = u'tést' + 'tést'.decode('ascii')
The solution, of course, is to explicitly (and correctly) indicate the encoding of the
unicode_string = u'tést' + 'tést'.decode('utf-8')
Python automatically converts types in many situations. This is often useful, for example when you're adding an
int to a
float number. But if you have a mix of
str objects, these implicit conversions often result in
This is why should be consistent and use
unicode everywhere: Don't change a single
str into a
unicode and hope that all will be well. This may just make things worse, because you're mixing
Another common mistake is the following:
byte_string = 'tést' byte_string = byte_string.encode('utf-8') # Results in UnicodeDecodeError
People often write code like this when they have a gut feeling that a
str should be utf-8 encoded, but don't really grasp the underlying mechanics. What happens here is the following: First,
byte_string is implicitly decoded to
unicode, using an ascii encoding; next, the
unicode is encoded back to a
str, using utf-8. In other words, the code is equivalent to:
byte_string = 'tést' byte_string = byte_string.decode('ascii').encode('utf-8')
And what fails is the
decode('ascii') part, because the 'é' cannot be decoded using ascii. I personally find it misleading that
str objects even have an
encode() function, and that
unicode objects have a
decode() function. But just remember the following rule:
decode()only to go from
encode()only to go from
In this post, I focus on Python 2, but in Python 3 the situation is similar. The main difference is that Python-3
strobjects are mostly equivalent to Python-2
unicodeobjects, and Python-3
bytesobjects are mostly equivalent to Python-2
strobjects. The Python-3 names make more sense. But the fact that
stris used for fundamentally different types of objects in Python 2 and 3 is insane. ↩
There is a lot of confusion about what Unicode means, and how it relates to UTF-8. But it's quite simple. In Python,
unicodeis an object type that is used to unambiguously represent text—but this is a Python-specific use of the term. More generally, Unicode is a standardized look-up table that links values to characters; for example, Unicode says that the value 233 stands for 'é'. UTF-8 is an implementation of Unicode. There are many technically different ways in which the Unicode look-up table can be implemented, and UTF-8 is one of those. So you could say that Unicode is the conceptual part of the encoding, and UTF-8 (and other implementations) is the technical part. Therefore, it doesn't make sense to say that a text file is Unicode-encoded--that's too vague. You need to know exactly how it's encoded, that is, which technical implementation of Unicode is used (usually UTF-8). ↩