Prefer video? Check out this YouTube tutorial.
Special characters are a pain for programmers, and a source of bugs. Whenever your code contains an accented letter, voil├Ā, funny-looking symbols appear; or, worse, your program crashes with an obscure error. Why does it have to be that hard!?
It doesn't. Character encoding is not that difficult once you understand the basic principles. Let's take a look.
Characters are interpretations
For a computer, there is no text, only bytes. Text is an interpretation, a way to visualize bytes in human-readable form. And if you interpret bytes wrong, you get funky-looking characters, or even an error. A character encoding is one specific way of interpreting bytes: It's a look-up table that says, for example, that a byte with the value 97 stands for 'a'.
Now let's consider these three bytes:
195-167-97
In Python 21, this is a str
object: a series of bytes without any information about how they should be interpreted. Don't let the name confuse you: A str
is not a string of characters; rather, it's a series of bytes that you can interpret as a string of characters. But the proper interpretation requires the proper encoding, and the problem is: A str
doesn't know its own encoding!
Now let's turn to actual Python:
# Mystery string!
byte_string = chr(195) + chr(167) + chr(97)
We have three bytes (195-167-97), but no idea what they stand for. How can we decipher these bytes into actual text? Well, we need …