Since the fassembler stuff has slowed down a bit, I’ve started examining the errors that are happening on our live site, in the interest of finding and removing the actual problem spots that users are experiencing. One of the most common error types I’m seeing is UnicodeDecodeError, usually generated by listen, although sometimes it’s caused by the wiki diff code.

Unicode errors are likely to bite all of us at some point, so I thought I’d take a few minutes to give some pointers. This will be very brief; there are plenty of i18n and unicode tutorials on teh interwebs, I’m not going to try to duplicate those efforts. Instead, I’m going to provide some basic principles that I’ve found useful in my own understanding of and development with unicode. Also keep in mind that this is ridiculously over-simplified and imprecise.

Unicode is the master template

I’ve found it useful to think of unicode as a “master” template for all textual data. What does this mean? It means that any piece of string data, in any language, involving any possible character you can imagine, can be represented as unicode. Furthermore, this unicode representation will be unique. No other textual data will be represented by the same unicode representation.

This makes unicode nice to work with, in an abstract sense. You can intermingle text from a thousand different languages, comprised of hundreds of different alphabets, and as long as you always use unicode to represent the text everything will be well behaved. Strings that are equal will appear to be equal, strings that are different will appear to be different. No muss, no fuss.  (Note: This is not precisely true; as ianb points out in his comment below, there can be variations in the way that unicode represents accented characters.)

Unicode is only useful in the abstract

So if unicode is so great, we should just always use unicode all the time and be done with it, right? If only it were so easy. For a number of reasons, none of which I’ll go into here, unicode is actually only useful when you’re manipulating textual data in the abstract. Once you actually want to display this textual information to a user, you must encode it into a specific, renderable character set.

Another way to think of this is that, while developers may love unicode, users can’t use it at all. Any time you get textual information from a user, it will be in some encoding. Any time you display textual information to a user, you have to encode it.

Decode -> Unicode; Encode -> some specific encoding

It took me a while to really remember this. I find it helps to remember that unicode is the master code. If you have a piece of text that you got from a user, via email, a web form, or some other means, you have to decode it from the encoding that was used into unicode. When you finish what you’re doing and you want to display the output to the user, you need to encode it back into an appropriate encoding. Decode == represent the text using the master code. Encode == represent the text using some specific encoding that the user will understand.

You have to use the right encoding

Encodings are not interchangeable. The same piece of text will always (not really, but let’s ignore that detail for now) be represented by the same piece of unicode. The same piece of text will not always be represented in a similar manner across encodings . Consider the following:

>>> orig = 'Mit freundlichen Grüßen'
>>> orig
'Mit freundlichen Grxc3xbcxc3x9fen'
>>> orig.__class__

>>> orig.decode('utf-8')
u'Mit freundlichen Grxfcxdfen'
>>> uni = orig.decode('utf-8')
>>> uni
u'Mit freundlichen Grxfcxdfen'
>>> uni.__class__

>>> iso = uni.encode('iso-8859-1')
>>> iso
'Mit freundlichen Grxfcxdfen'
>>> iso.__class__

>>> iso == orig
False
>>> uni == iso.decode('iso-8859-1')
True
>>> uni == iso.decode('utf-8')
Traceback (most recent call last):
  File "", line 1, in ?
  File "encodings/utf_8.py", line 16, in decode
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 19-22: unexpected end of data
>>> orig == uni.encode('utf-8')
True­

My terminal uses the ‘utf-8′ character set, so when I did the original decoding I had to use ‘utf-8′. This gave me pure, clean unicode. From the unicode, I could encode the text back as either utf-8 or as iso-8859-1 (aka ‘latin-1′), but these are NOT equivalent. If I had used iso-8859-1 to do my decoding in the first place, I would have gotten the wrong unicode. You always need to decode into unicode using right encoding, and you always need to encode back using an encoding that the user can handle.

The Golden Rule: Decode early, encode late

With all this in mind, then, we’ve come to what I consider the golden rule of handling text data. Like all golden rules, there are good times to break this one, but if you use it as a starting point I think you’ll find you have less headaches. The golden rule is this: You should try to decode all text data into pure unicode at the earliest possible point (ideally right when you get it from the user), and you should wait until the last moment to encode your data back into a specific encoding (ideally right before sending the data out to the user).

When you abide by the golden rule, most of your code will be manipulating pure, clean unicode data, and you won’t have to worry much about encodings at all. You can mix and match, compare, and merge text as needed to your heart’s content; since it’s all unicode, everything will play nicely. Ideally, all encoding and decoding will happen at the boundaries of your programs.

As I said, this is hardly precise. I see these points as the start of a reasonable unicode discussion, not the end. But keeping these ideas in mind has helped this poor American come to better terms with unicode and character sets, hopefully it will do the same for you.

Filed January 15th, 2008 under Training, Design

Hoory for stuff that takes way longer than it ought to!  In my case, it’s probably not so much a problem with architecture as a lack of familiarity with the tools I’m using, but I did manage to spend most of the last three days trying to figure out why a component I added to Geoserver broke a lot of it (fortunately, nothing mapping-related was broken, but configuration and the demo page were).

I kind of feel like this blog is a bad place to go into too much depth about the details of Java servlet code, but I’ll try and explain what I’m talking about.  The feature I was trying to add to Geoserver was a request logger that tracks some details about each incoming HTTP request, including the body of PUT/POST requests.  The Java servlet API exposes the bodies of such requests as a stream which can only be read once, so by logging the body I was preventing Geoserver from being able to actually do anything with it.

The solution to this, at least in the Java world, is to create a wrapper object that for the most part just delegates to the original request object, but for the purposes of reading the body creates streams based on an in-memory buffer of the request body.  Java provides a HttpServletRequestWrapper abstract class for just such occasions, so all I had to do was override the methods that provided access to the body;  a fairly straightforward task.  All was well until Tim found that my filter broke the geoserver demo request page.

 Initially, I assumed I had missed some corner case in my code allowing access to the body (this is not quite as crazy as it seems since the servlet API allows access in both of Java’s versions of a stream, InputStream and Reader, and the request is supposed to blow up if both of them are requested on the same request).  So I wrote a few tests to try and work out what was wrong with them, and found nothing.  I moved on to the stack trace and found a largely unhelpful error message indicating a line squarely in code I didn’t have the source to, and spent a while trying to work out exactly what was going on (as well as discovering that none of my code was even being run!!).  Finally, I idly switched my logging code from using an InputStream to using a Reader to get the request body, and all of a sudden the stack trace was pointing at code in the request! 

 As it turns out, there are more than two ways to access the body of a HttpServletRequest in Java.  The third is to call one of four or so functions that provide access to form values, and the API clearly states that it’s fine for implementers to fail if the body has already been read when the form values are requested.  It ended up only taking a few dozen lines (around half of them converting the form values to different data structures) to fix the problem.

 

So, what was the cause of all of this trouble?  I guess it’s all my fault for not fully reading the API specification for the class I was implementing.  Then again, it’s rather strange that the servlet API implementation I was using didn’t complain about trying to read the stream twice when I was using an InputStream to read it.  Maybe it’s because an HTML feature (form parsing) is part of the HttpServletRequest API.  I think the biggest issue, though, was that I gave up too easily trying to get the source code in eclipse for debugging.  The interface for attaching source code to a referenced .jar file is buried in the project settings and doesn’t make it particularly obvious that you can change the setting.  Or possibly, there’s something to the suggestions I’ve been hearing that Java might not be so great a language.

At any rate, my next servlet request implementation will be a lot less problematic. 

Filed January 7th, 2008 under Blogging, Design

Once again, I’d been thinking of making a blog post about what I’m working on (short version: I’m taking a break from REST configuration (I’ll leave the puns as an exercise for the reader) to work on what basically amounts to a server-side optimization for the Vespucci project) but I find myself more interested in this discussion on the OpenCore dev list. I thought about making this post an email to that list, but the comments I’m planning to make are basically tangential to the thread there.

What I take away from that thread is that we the OpenCore developers are planning on making people on openplans.org more like projects, with their own featurelets (mailing lists and task trackers and such) to go along with the wiki that’s currently provided. This seems kind of weird to me; as my understanding is that openplans.org is about projects and making it easy for groups with similar goals to collaborate and share skillsets and experience. It’s not clear to me how letting a person track personal tasks or run a personal blog is helping to accomplish that. Instead, I would think a generalization of my earlier thoughts on blogging would be more appropriate: have tabs on the user account page for each of the available featurelets, but show them as a filter on the entire site rather than unique content. So, the task tracker tab would show only tasks assigned to the user, the mailing lists tab would show threads the user participated in, etc. That would let you see things from a people-oriented point of view without creating this kind of island of content where the user has their own personal stuff on a site that’s supposedly intended to help them share with others.

Of course, I have no idea whether this really makes sense. I often find that ways of doing things that make sense to me are kind of lost on non-developers (like when I bitch to Windows users about how hard it is to change file extensions on their platform and they stare blankly and wonder what possible reason you could have to want to do that). But, I’m not a developer on OpenCore so hopefully my perspective on things isn’t too tainted by elbow grease.

Another thought that occurred to me while thinking about this is that if people are projects, and people can be members of projects, then OpenCore obviously supports having projects as members of projects. This excited me more than a little (being a computer geek, I of course love hierarchies!) I think if that sort of nesting of projects is allowed then you can structure things in a way that makes a lot of sense. For example, you can have an Organization be an entity recognized by openplans.org in the same way that People and Projects currently are. An Organization could probably be nearly identical to a Person, but with some string changes (an organization has a logo, not a photo, and services rather than skills, etc.) Of course you can’t log in as an organization either. Organizations and Projects could have as members People, Projects, or other Organizations (think of local chapters or subdivisions for Organization->Organization, and planning committees for Project->Organization) ), and People wouldn’t be allowed to have any members, of course. The neat thing that I see here is that the membership wouldn’t have to be exclusive; in the same way that People can be members of multiple Projects, Projects could be sponsored by multiple Organizations. Organizations could have multiple parent Organizations as well (like Geoserver could be both a subdivision of TOPP and a member of OpenGIS).

Anyway, like I said I’m not really aware enough of OpenPlans to know whether any of what I said is really applicable. To follow in the recent trend of posts on what success is for openplans, I’d say it’s probably not measured by how well the software models the relationship between people, projects, and organizations :) But I do think that modeling them well makes it easier to present them in a navigable way (that is, making it easier to find projects that are related in the types of ways our software knows about). If our goal for openplans.org is to bring projects together, then it’s probably a step in the right direction.

Filed December 5th, 2007 under OpenCore, User Experience, Design
  1. Know the audience

    • How much do you know and value about ’semantic web’ and good markup?
  2. Basics (if needed)

    • http://alistapart.com/articles/grokwebstandards

      This is a good overview of the ‘theory’ behind web standards: why they are useful and why you should care.

      ­”When we first begin designing for the web, we’ll use HTML and CSS crudely, as a means to an end—a method of arranging pretty boxes in space—without grasping the true nature of the box itself or what it contains. Altering that strictly visual mentality is the highest hurdle to overcome when [first diving] into semantics and web standards.”

    • http://www.456bereastreet.com/archive/200711/posh_plain_old_semantic_html/

      POSH, in case you haven’t heard of it already, is short for “Plain Old Semantic HTML”, and is obviously much quicker and easier to say than “valid, semantic, accessible, well-structured HTML”.

    • http://www.nypl.org/styleguide/

      This is a good primer on best practices regarding XHTML and CSS, it’s a bit dated but still useful

      “This Style Guide for the Branch Libraries of the New York Public Library explains the markup and design requirements for all Branch Libraries web projects, along with various standards and best practices.”

    • http://www.alistapart.com/stories/betterliving/

      “An unauthorized companion to the Online Style Guide of the Branch Libraries of The New York Public Library”

    • A note on DIVs & SPANs: Inline vs Block-level elements

      Block-level elements treat multiple elements on the page as a single block, putting them all in one rectangular area (think </p>, <div>, <table>, and <blockquote>). Inline elements– and this is the part that’s easy to remember– aren’t treated as a block; instead, they fall within lines (ie, “in line”) and can go next to other inline elements (think <span>, <a>, and <img>). One thing to remember is that block elements cannot go inside inline elements.

  3. Conventions

    • Separation of content, style, and behavior
      • http://www.alistapart.com/articles/behavioralseparation
      • A good follow-up, with examples, to Nick’s comments regarding unobtrusive Javascript… The same rules apply to CSS.
      • “Breaking up is hard to do. But in web design, separation can be a good thing. Content, style, and behavior all deserve their own space.”
    • Comment end tags of block-level elements, within the block
    • Use units as appropriate: em, px, etc
  4. Common elements

    • Lessons from NUI
      • The importance of consistency
      • Renew, reuse, recycle
      • Productivity and predictability
    • Overall page structure
      • oc-topnav, oc-content-main, oc-content-sidebar, oc-footer, etc
    • Generic elements
      • oc-headingBlock
      • oc-plainList
      • oc-dataTable
      • oc-form
      • oc-getstarted
  5. Any questions?

    • BONUS: Ask me why Transcluder scares me.

­

Filed November 1st, 2007 under Training, TOPP, Design

As we move out of the Dark Age of Kupu and into the Enlightened Age of Xinha we are faced with an interesting set of issues and opportunities: not only do we have to determine the degree of user control we allow but also—and to some degree as consequence to that—we have to question certain fundamental assumptions about the (in)appropriateness of the WYSIWYG model. For other information regarding our WYSIWYG decisions see: What WYSYWIG is Up To and the WYSYIWYG Wishlist.

User Control

In recent conversations with Jackie I argued for limiting (though I prefer to say managing) user options to increase user happiness. My justification for this is best described by the adjacent line graphs.

­Our editor should make it easy for our users to create documents for the web, but it should not provide so many options as to be stressful or confusing. As Kathy Sierra says, “the amount of pain and effort should match the user’s perceived payoff”. Our users want control and ownership of their pages, and that is something with which we provide them, but at what point do we provide so many options that the user no longer feels in control of the page?

We need to tailor the editor to fulfill the needs of our users: basic word processing, including basic inline type styling (bold, italic, underline, strikethrough, color); list creation (ordered and unordered, definition lists(?), and nesting); table creation and management (table stying, row and column manipulation); image inclusion; hyperlinking and page creation (wicked, internal, and external should be consolidated). But, we don’t need to provide much more… in fact we probably shouldn’t provide things like: font families, absolute font sizes, perhaps even text alginment (left, right, center)…

The WYSIWYG Model

…because the WYSIWYG model, as we have been thinking about it, may not be the ideal way of fulfilling these (perceived) needs. As Luke and I discussed it: the WYSIWYG model is predicated on emulating the experience of a typical word processor, specifically in regards to designing for presentation, and thus fails engage with the medium that is the web. As stated above, our editor should make it easy for our users to create documents for the web. The concept we have been working with emphasizes presentation (how things look) when it should equally emphasize structure (how things work) and semantics (what things mean).

To provide our users with a better experience we need to better understand web standards (like A List Apart) and incorporate all three modes of thinking/editing into our editor in a way that is easy for users to understand. WYMEditor, an example Luke pointed out, works on a WYSIWYM model and emphasizes the meaning of content over its presentation.

The WYSIWYM Model: Ideas About Implementation

It’s possible and desirable to implement core elements of the WYSIWYM model in our editor in several ways:

+ Clean Markup

Example: If something like text-alignment is crucial (which it’s not) then non-exclusive block-level CSS classes are the place where it should be defined, not the “align” or even “style” attributes.

Producing acceptable (X)HTML is perhaps the technically most difficult yet also more important thing for our editor to do. There is little point in focusing on the presentational niceties if the markup won’t render correctly anyway. The markup produced should strive to be structurally and semantically proper, conform to standards, and not be contaminated by visual information like font styles and weights, borders, colors, etc.

We can accomplish by limiting certain tags (like the evil “font” tag), running operations on the output markup to ensure a base level of acceptability, and outsourcing visual information to pre-defined CSS classes that are applied at the block level wherever possible. Ideally the editor could even be aware of the block-level context and provide only those styles which are appropriate to that block type (for example table styles for tables, list styles for lists, etc). By constraining users’ choices intelligently we make it easier for the user to get things done. Furthermore, if we don’t provide clean markup then any future support for theming or alternate styles will result in extreme pain for both us and our users.

+ Semantic/Structural User-end Views

The editor should make the structure of the page transparent to our users, either by default or through a toggle (like the ¶ button in Microsoft Word). As you can see from their demo, WYMeditor does a good job of this (they clearly leave out presentational elements as they are committed to a “pure” WYSIWYM model). This level of information benefits our users because they learn to think of the web as distinct from print media and enables them to build pages that will function across themes and Deliverations without breaking.

We can accomplish a similar effect using CSS specific to the editor context that adds borders, backgrounds, margin, padding, and (vernacular) labels to the block-level elements as we determine to be appropriate.

Wiki Pages vs Collaborative Documents

As an aside, we should have the editor be somewhat context dependent. Quoting/paraphrasing an earlier comment Cholmes made on the the WYSYIWYG Wishlist: when people are in the collaborative document editing space it’d be nice if they could have access to the full range of fonts, sizes, etc. because that’s the area we want people treating it like a word processor, not the wiki pages. We’d have to discuss how we would make collaborative documents migratable to wiki docs from this perspective as well as a variety of other issues specific to the differences between these two document types.

Filed February 14th, 2007 under User Experience, Design, OpenPlans

Frisbee Design Draft 2

Filed January 3rd, 2007 under TOPP, Design