Since the fassembler stuff has slowed down a bit, I’ve started examining the errors that are happening on our live site, in the interest of finding and removing the actual problem spots that users are experiencing. One of the most common error types I’m seeing is UnicodeDecodeError, usually generated by listen, although sometimes it’s caused by the wiki diff code.

Unicode errors are likely to bite all of us at some point, so I thought I’d take a few minutes to give some pointers. This will be very brief; there are plenty of i18n and unicode tutorials on teh interwebs, I’m not going to try to duplicate those efforts. Instead, I’m going to provide some basic principles that I’ve found useful in my own understanding of and development with unicode. Also keep in mind that this is ridiculously over-simplified and imprecise.

Unicode is the master template

I’ve found it useful to think of unicode as a “master” template for all textual data. What does this mean? It means that any piece of string data, in any language, involving any possible character you can imagine, can be represented as unicode. Furthermore, this unicode representation will be unique. No other textual data will be represented by the same unicode representation.

This makes unicode nice to work with, in an abstract sense. You can intermingle text from a thousand different languages, comprised of hundreds of different alphabets, and as long as you always use unicode to represent the text everything will be well behaved. Strings that are equal will appear to be equal, strings that are different will appear to be different. No muss, no fuss.  (Note: This is not precisely true; as ianb points out in his comment below, there can be variations in the way that unicode represents accented characters.)

Unicode is only useful in the abstract

So if unicode is so great, we should just always use unicode all the time and be done with it, right? If only it were so easy. For a number of reasons, none of which I’ll go into here, unicode is actually only useful when you’re manipulating textual data in the abstract. Once you actually want to display this textual information to a user, you must encode it into a specific, renderable character set.

Another way to think of this is that, while developers may love unicode, users can’t use it at all. Any time you get textual information from a user, it will be in some encoding. Any time you display textual information to a user, you have to encode it.

Decode -> Unicode; Encode -> some specific encoding

It took me a while to really remember this. I find it helps to remember that unicode is the master code. If you have a piece of text that you got from a user, via email, a web form, or some other means, you have to decode it from the encoding that was used into unicode. When you finish what you’re doing and you want to display the output to the user, you need to encode it back into an appropriate encoding. Decode == represent the text using the master code. Encode == represent the text using some specific encoding that the user will understand.

You have to use the right encoding

Encodings are not interchangeable. The same piece of text will always (not really, but let’s ignore that detail for now) be represented by the same piece of unicode. The same piece of text will not always be represented in a similar manner across encodings . Consider the following:

>>> orig = 'Mit freundlichen Grüßen'
>>> orig
'Mit freundlichen Grxc3xbcxc3x9fen'
>>> orig.__class__

>>> orig.decode('utf-8')
u'Mit freundlichen Grxfcxdfen'
>>> uni = orig.decode('utf-8')
>>> uni
u'Mit freundlichen Grxfcxdfen'
>>> uni.__class__

>>> iso = uni.encode('iso-8859-1')
>>> iso
'Mit freundlichen Grxfcxdfen'
>>> iso.__class__

>>> iso == orig
False
>>> uni == iso.decode('iso-8859-1')
True
>>> uni == iso.decode('utf-8')
Traceback (most recent call last):
  File "", line 1, in ?
  File "encodings/utf_8.py", line 16, in decode
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 19-22: unexpected end of data
>>> orig == uni.encode('utf-8')
True­

My terminal uses the ‘utf-8′ character set, so when I did the original decoding I had to use ‘utf-8′. This gave me pure, clean unicode. From the unicode, I could encode the text back as either utf-8 or as iso-8859-1 (aka ‘latin-1′), but these are NOT equivalent. If I had used iso-8859-1 to do my decoding in the first place, I would have gotten the wrong unicode. You always need to decode into unicode using right encoding, and you always need to encode back using an encoding that the user can handle.

The Golden Rule: Decode early, encode late

With all this in mind, then, we’ve come to what I consider the golden rule of handling text data. Like all golden rules, there are good times to break this one, but if you use it as a starting point I think you’ll find you have less headaches. The golden rule is this: You should try to decode all text data into pure unicode at the earliest possible point (ideally right when you get it from the user), and you should wait until the last moment to encode your data back into a specific encoding (ideally right before sending the data out to the user).

When you abide by the golden rule, most of your code will be manipulating pure, clean unicode data, and you won’t have to worry much about encodings at all. You can mix and match, compare, and merge text as needed to your heart’s content; since it’s all unicode, everything will play nicely. Ideally, all encoding and decoding will happen at the boundaries of your programs.

As I said, this is hardly precise. I see these points as the start of a reasonable unicode discussion, not the end. But keeping these ideas in mind has helped this poor American come to better terms with unicode and character sets, hopefully it will do the same for you.

Filed January 15th, 2008 under Training, Design
  1. Know the audience

    • How much do you know and value about ’semantic web’ and good markup?
  2. Basics (if needed)

    • http://alistapart.com/articles/grokwebstandards

      This is a good overview of the ‘theory’ behind web standards: why they are useful and why you should care.

      ­”When we first begin designing for the web, we’ll use HTML and CSS crudely, as a means to an end—a method of arranging pretty boxes in space—without grasping the true nature of the box itself or what it contains. Altering that strictly visual mentality is the highest hurdle to overcome when [first diving] into semantics and web standards.”

    • http://www.456bereastreet.com/archive/200711/posh_plain_old_semantic_html/

      POSH, in case you haven’t heard of it already, is short for “Plain Old Semantic HTML”, and is obviously much quicker and easier to say than “valid, semantic, accessible, well-structured HTML”.

    • http://www.nypl.org/styleguide/

      This is a good primer on best practices regarding XHTML and CSS, it’s a bit dated but still useful

      “This Style Guide for the Branch Libraries of the New York Public Library explains the markup and design requirements for all Branch Libraries web projects, along with various standards and best practices.”

    • http://www.alistapart.com/stories/betterliving/

      “An unauthorized companion to the Online Style Guide of the Branch Libraries of The New York Public Library”

    • A note on DIVs & SPANs: Inline vs Block-level elements

      Block-level elements treat multiple elements on the page as a single block, putting them all in one rectangular area (think </p>, <div>, <table>, and <blockquote>). Inline elements– and this is the part that’s easy to remember– aren’t treated as a block; instead, they fall within lines (ie, “in line”) and can go next to other inline elements (think <span>, <a>, and <img>). One thing to remember is that block elements cannot go inside inline elements.

  3. Conventions

    • Separation of content, style, and behavior
      • http://www.alistapart.com/articles/behavioralseparation
      • A good follow-up, with examples, to Nick’s comments regarding unobtrusive Javascript… The same rules apply to CSS.
      • “Breaking up is hard to do. But in web design, separation can be a good thing. Content, style, and behavior all deserve their own space.”
    • Comment end tags of block-level elements, within the block
    • Use units as appropriate: em, px, etc
  4. Common elements

    • Lessons from NUI
      • The importance of consistency
      • Renew, reuse, recycle
      • Productivity and predictability
    • Overall page structure
      • oc-topnav, oc-content-main, oc-content-sidebar, oc-footer, etc
    • Generic elements
      • oc-headingBlock
      • oc-plainList
      • oc-dataTable
      • oc-form
      • oc-getstarted
  5. Any questions?

    • BONUS: Ask me why Transcluder scares me.

­

Filed November 1st, 2007 under Training, TOPP, Design

During the last couple of months we made a head-long push to get nui out the door and we succeeded. A week later we redeployed fixing a number of important bugs. It has been a rich time in that we have had to develop new ways of working together and communicating, and had to develop new processes for a thorough QA review and deployment. I think we all made some mistakes along the way and have learned by them. As they say, experience is the best teacher.

Over the last few days, rmarianski and myself, with input from everyone else, have been developing a schedule and plan to bring the new hires up to speed when they arrive on Sept 12th. This has been a challenge and has also been lot of fun. As it stands, the new hires will begin working on projects which integrate other TOPP projects, such as an openplans/geoserver bridge which would, say, allow people to search for projects by geographic location. I think these projects will be exciting and will immediately integrate the new hires into the organization. Having them here, I think, will significantly alter TOPP, our culture, and the way we work together and I think we will all have to continue to adapt to accommodate this.

On Monday, most of us will be pushing off for Burning Man. I really have no idea what this will be like. I’m quite looking forward to being with our people in a completely different environment and social culture–in a place where we’ll have to rely upon each other in ways we haven’t had to before. Personally, I’m also looking forward to just getting out of the city for a week and having a change of pace. Ok, see you there!

Filed August 24th, 2007 under Training, Kicking Ass, OpenPlans