My bad opinions

2011/09/26

Will the real Unicode wrangler please stand up?

Clever section title #1

There is a saying among Erlang regulars that string handling isn't that bad in Erlang. It's a question of choosing the right data structure: lists of characters, binaries, IO lists. Each of them has advantages in given circumstances. Lists are okay for general cases, strings you want to manipulate and iterate through. Binaries are fine for compact information that you want to carry around, as they're more like immutable arrays of characters. IO lists are lists of bytes (0..255) and/or binaries and/or other IO lists that let you do easy appending operations, allowing you to somewhat lazily build complex strings.

After that, it's kind of a breeze. All you have to do is stick to your structure, and convert them with functions such as iolist_to_binary/1, list_to_binary/1 and binary_to_list/1.

Except that if you know a bit about Unicode, you might have squirmed a bit in your chair (or when standing because you use one of these funny standing desks) when I mentioned 'arrays of characters' and 'lists of characters'.

A background check in Unicode

If you don't know Unicode, here's the basic rundown. Characters as bytes is no longer the truth. FORGET ABOUT IT RIGHT NOW.

Unicode works with characters as bytes for old ASCII stuff and most of the Latin-1 charsets (oh, to be Western-centric when designing stuff). That's why most of the time we can just blindly keep going our merry way and pretend Erlang's got great support for Unicode! Most of the time, we all add a few lambda signs here and here to our strings when discussing programming languages (because boy do we love functional stuff) and call it a day.

The truth of Unicode is that you have things like 'graphemes', 'codepoints', 'grapheme clusters', 'glyphs' and etc. I won't get into all the details because, frankly, I haven't read the whole stack of Unicode documents and couldn't give an absolutely accurate description. In any case, the point is this: while some characters can be represented as a single byte, true Unicode can be way harsher than this.

A basic character such as 'é' can be represented as [233] (U+00E9). That's the usual stuff we're used to. However, you have alternative representations where you can use [101,769] (U+0065 and U+0301). The latter one will still convert to 'é' if printed right, but is instead composed of the latin letter 'e' and a special combining acute accent.

There can be more complex representations, having 3 or more of these composable characters. Better than that, some of them can be context-sensitive. I've had bugs where the sequence of codepoints [Unicode1, Unicode2] would be two standard characters when displayed. However, once added to any other string (let's say [$a, Unicode1, Unicode2]), the unicode sequence is still two characters (or glyphs, as in printed out letters). What was two things standing alone was changed to a letter being composed of two code points and one letter or character that stands on its own. Still two characters.

Because of this, you can quickly see that your good old length and comparison functions are broken! Indeed, [233] and [101,769] are represented differently, and if you use the default operators and functions of Erlang (or any non-unicode-ready language), they will have different lengths and be considered different things. However, in human languages, they are both 'é' and should be equal on all points.

The Unicode standard provides functions and algorithms to handle these and provide libraries (once implemented) to work with far more ease.

And the shit hits the fan

That's where things go bad. Erlang has no default library to handle these. You have to rely on community-provided libraries such as ux to do it for you.

That's the shocker: if you say your application is compliant with UTF-8, UTF-16 or UTF-32 (implementations of Unicode) and you manipulate strings without any of these libraries, then you've been lying all along.

But wait, there's more!

Let's say I'm taking a unicode string such as 'Привет!', which I've been told means 'hi!' in Russian. If I represent it as a binary, I will get <<208,159,209,128,208,184,208,178,208,181,209,130,33>>. If you try to output that in an Erlang shell:

1> io:format("~s~n",[<<208,159,209,128,208,184,208,178,208,181,209,130,33>>]).
Привет!
ok
2> io:format("~ts~n",[<<208,159,209,128,208,184,208,178,208,181,209,130,33>>]).
Привет!
ok

See this? Unicode strings are only printed right (although they look different due to different fonts) when using the ~ts parameters instead of just ~s. That's a special modifier telling it you want a multibyte string, not just random bytes printed out.

Do you want better?

3> io:format("~ts~n",[binary_to_list(<<208,159,209,128,208,184,208,178,208,181,209,130,33>>)]).
Привет!
ok

Ah, now it still breaks! This is because functions like list_to_binary/1, iolist_to_binary/1 and binary_to_list/1 only operate on bytes. Not characters built of many bytes. The representation gets broken when converting. Binaries can not contain the same kind of internal representations as strings: even trying list_to_binary([101,769]) will crash.

Yes, if you use any of these functions on sequences that may comprise unicode strings, you're possibly breaking them.

Thankfully, Erlang has partial support of unicode. For such conversions, take a look at the unicode module. It contains functions such as unicode:characters_to_list/1 and unicode:characters_to_binary/1-3, which will help you do your conversions right. You will still have to download some library to do basic operations such as calculating the length of a string, but at least, you won't be messing up all of your data.

As an example, unicode:characters_to_binary([16#65,16#301]) will yield <<101,204,129>> an actually correct representation. The reason for this is that the number 769 can't be represented as a single byte. Using list_to_binary/1 will crash, but not when using the unicode module. Conversely, if you get <<101,204,129>> and convert it to a string using binary_to_list, you'll get the list [101,204,129], which is an entirely different unicode string. Using the unicode module, you'll get the right thing back because that module will be clever enough to convert things in the right unicode sequence.

So is string handling a pain in Erlang?

For ASCII and Latin-1? No. It's just a question of picking the right data structure. For unicode? Yes. Extra care has to be taken. Use ~ts to output strings, make sure to convert data types with the right functions, and you then need to download external string-handling libraries to work correctly. The last point is something that's actually painful to do.

It is painful, not because of the data structures used, but because of the string libraries around the language and their need for unicode support. The problem is pervasive and even hits some web servers (misultin, although it should be fixed soon after the writing of this blog) and likely many applications you (yes you! the reader! hi! Привет!) and I will have built.

Hopefully, you know that the problem exists, if you didn't.