Since the fassembler stuff has slowed down a bit, I’ve started examining the errors that are happening on our live site, in the interest of finding and removing the actual problem spots that users are experiencing. One of the most common error types I’m seeing is UnicodeDecodeError, usually generated by listen, although sometimes it’s caused by the wiki diff code.
Unicode errors are likely to bite all of us at some point, so I thought I’d take a few minutes to give some pointers. This will be very brief; there are plenty of i18n and unicode tutorials on teh interwebs, I’m not going to try to duplicate those efforts. Instead, I’m going to provide some basic principles that I’ve found useful in my own understanding of and development with unicode. Also keep in mind that this is ridiculously over-simplified and imprecise.
Unicode is the master template
I’ve found it useful to think of unicode as a “master” template for all textual data. What does this mean? It means that any piece of string data, in any language, involving any possible character you can imagine, can be represented as unicode. Furthermore, this unicode representation will be unique. No other textual data will be represented by the same unicode representation.
This makes unicode nice to work with, in an abstract sense. You can intermingle text from a thousand different languages, comprised of hundreds of different alphabets, and as long as you always use unicode to represent the text everything will be well behaved. Strings that are equal will appear to be equal, strings that are different will appear to be different. No muss, no fuss. (Note: This is not precisely true; as ianb points out in his comment below, there can be variations in the way that unicode represents accented characters.)
Unicode is only useful in the abstract
So if unicode is so great, we should just always use unicode all the time and be done with it, right? If only it were so easy. For a number of reasons, none of which I’ll go into here, unicode is actually only useful when you’re manipulating textual data in the abstract. Once you actually want to display this textual information to a user, you must encode it into a specific, renderable character set.
Another way to think of this is that, while developers may love unicode, users can’t use it at all. Any time you get textual information from a user, it will be in some encoding. Any time you display textual information to a user, you have to encode it.
Decode -> Unicode; Encode -> some specific encoding
It took me a while to really remember this. I find it helps to remember that unicode is the master code. If you have a piece of text that you got from a user, via email, a web form, or some other means, you have to decode it from the encoding that was used into unicode. When you finish what you’re doing and you want to display the output to the user, you need to encode it back into an appropriate encoding. Decode == represent the text using the master code. Encode == represent the text using some specific encoding that the user will understand.
You have to use the right encoding
Encodings are not interchangeable. The same piece of text will always (not really, but let’s ignore that detail for now) be represented by the same piece of unicode. The same piece of text will not always be represented in a similar manner across encodings . Consider the following:
>>> orig = 'Mit freundlichen Grüßen'
>>> orig
'Mit freundlichen Grxc3xbcxc3x9fen'
>>> orig.__class__
>>> orig.decode('utf-8')
u'Mit freundlichen Grxfcxdfen'
>>> uni = orig.decode('utf-8')
>>> uni
u'Mit freundlichen Grxfcxdfen'
>>> uni.__class__
>>> iso = uni.encode('iso-8859-1')
>>> iso
'Mit freundlichen Grxfcxdfen'
>>> iso.__class__
>>> iso == orig
False
>>> uni == iso.decode('iso-8859-1')
True
>>> uni == iso.decode('utf-8')
Traceback (most recent call last):
File "", line 1, in ?
File "encodings/utf_8.py", line 16, in decode
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 19-22: unexpected end of data
>>> orig == uni.encode('utf-8')
True
My terminal uses the ‘utf-8′ character set, so when I did the original decoding I had to use ‘utf-8′. This gave me pure, clean unicode. From the unicode, I could encode the text back as either utf-8 or as iso-8859-1 (aka ‘latin-1′), but these are NOT equivalent. If I had used iso-8859-1 to do my decoding in the first place, I would have gotten the wrong unicode. You always need to decode into unicode using right encoding, and you always need to encode back using an encoding that the user can handle.
The Golden Rule: Decode early, encode late
With all this in mind, then, we’ve come to what I consider the golden rule of handling text data. Like all golden rules, there are good times to break this one, but if you use it as a starting point I think you’ll find you have less headaches. The golden rule is this: You should try to decode all text data into pure unicode at the earliest possible point (ideally right when you get it from the user), and you should wait until the last moment to encode your data back into a specific encoding (ideally right before sending the data out to the user).
When you abide by the golden rule, most of your code will be manipulating pure, clean unicode data, and you won’t have to worry much about encodings at all. You can mix and match, compare, and merge text as needed to your heart’s content; since it’s all unicode, everything will play nicely. Ideally, all encoding and decoding will happen at the boundaries of your programs.
As I said, this is hardly precise. I see these points as the start of a reasonable unicode discussion, not the end. But keeping these ideas in mind has helped this poor American come to better terms with unicode and character sets, hopefully it will do the same for you.