« On the job: recommended reading | Main | 80% done »

What? Unicode?

Ted Leung: There Ain't No Such Thing As Plain Text

I got my introduction to character encodings and Unicode the hard way, when I was working on XML. Joel Spolsky has written a good introduction.

I'm glad Joel Spolsky is blogging about this because now tens of thousands of developers will realize their XML generation code sucks.

If you're not up on this issue, you need to be. If want more detail, you can read Unicode: A Primer is a good book.

Second the book. This is also worth printing off: A tutorial on character code issues, As is Uche's Proper XML Output in Python. Tim Bray also wrote a fine series on encodings not so long ago.

My take all this urging, is that while it's good to insist people know Unicode, please keep in mind it's somewhat difficult to ThinkUnicode at the start. I mean difficult in the same way event driven programs or parallelization can be, except with Unicode you have to learn to stop believing your eyes (literally).

[I'd probably give a present to the person who wrote an IntelliJ plugin, something like a hex viewer, to display XML files as Unicode code points.]

And while we're on the subject: in Java, as char represents a UTF-16 codepoint, not whatever we grew up thinking it was (a character probably), that makes String a UTF-16 codepoint API - an undocumented one naturally :)

[bubba sparxxx: bubba talk]

October 16, 2003 01:12 AM


Guillaume Laforge
(October 17, 2003 07:13 AM #)

Regarding IntelliJ, it is already able to open files with the right encoding. I know it because I wrote some classes regarding charset discovery that got integrated into our beloved IDE thanks to Maxim Shafirov. In the former version (3.0), in the title bar, you could see (UTF-8) written beside the file name which was currently opened. In the late EAPs, this feature has disapeared unfortunately. You can have a look at this entry on my weblog regarding the code used inside IntelliJ at http://glaforge.free.fr/weblog/index.php?itemid=39

On a side note, you can vote on the following SCRs regarding encodings :
http://www.intellij.net/tracker/idea/viewSCR?publicId=13261 and

And by the way, what kind of present you'd offer me if I was to write a plugin indicating the current editor's encoding ? ;-)

Guillaume Laforge

Guillaume Laforge
(October 20, 2003 08:38 PM #)

I've finally developped a plugin that shows in IntelliJ's status bar the encoding of the current editor.

You can find it here :


Trackback Pings

TrackBack URL for this entry: