Tuesday 8 January 2008
Strings are perhaps the most important data type. They're probably the second most used after fixed-width integers, but they consume far more memory. Measurements of large Java apps such as Eclipse and Websphere would often show 30% or more of the live heap being consumed by strings. Not only are strings important, there are also many degrees of freedom for designing string representations and APIs, and you see large projects making a variety of choices, even though the impact of those choices is not widely understood. I believe strings are hugely understudied and there's a lot of interesting PhD-level research that could be done on strings. Tragically, strings (like email and parsing) are considered "old hat" and unlikely to attract top-class research attention. Anyway, let me jot down a few thoughts that have been festering for a long time...
UTF-16 is the devil's work. Once upon a time the Unicode consortium promised that 16 bits would be enough space for all the characters of the world's languages. Everyone believed it and, since 16 bits is not too much more than the 8 bits people were already using for Latin text, everyone designed their "Unicode" APIs to work with 16-bit characters, because that was simpler than variable-width representations. Later, though, the consortium realized they'd made a wee mistake, 16 bits wasn't going to be enough. No-one wanted to change all their APIs and using 32 bits per character is a hefty penalty for Latin text so everyone redefined their 16-bit units to mean "UTF-16 code units", i.e., some characters are represented with two code units ("surrogate pairs"). The problem is, UTF-16 thus has basically the same complexity as a good variable-width format such as UTF-8, but uses double the memory of UTF-8 for Latin text.
No-one really needs charAt. One of the major touted advantages of a uniform character size is that it makes charAt(index) trivially efficient. However, almost all the code I've ever seen that indexes into strings is actually iterating through strings (or using APIs that return indexes into strings, which could just as easily use iterators). Implementing efficient bidirectional iterators for UTF-8 is trivial. You only really need fast charAt for random access into strings, and the only code I know of that needs it is Boyer-Moore string search --- but that can be easily implemented efficiently for UTF-8 just by searching over UTF-8 code units instead of Unicode characters.
UTF-8 rules. UTF-16 is nearly as complex as UTF-8. UTF-16 is actually worse to code for because surrogate pairs are extremely rare, much rarer than multibyte UTF-8 characters, so you get code paths that are not well tested or code where surrogate pairs are not handled at all. For Latin text UTF-8 uses only half the space (note that even with non-Latin languages, programs still manipulate lots of Latin text internally), and you have the excellent property that ASCII text is valid UTF-8. In performance-critical code (such as our DOM implementation) you often see optimizations that switch between UTF-16 or "UCS-1" (text that's all characters 0-255, 8 bits per character), with code duplication for the "8-bit path" and the "UTF-16 path". With UTF-8 none of this is necessary, which is really good --- code duplication is evil, it requires new families of string types, forces programmers to make choices about string representations, and requires conversions that are potentially costly in code and time. UTF-8 is also easier to migrate to from ASCII, you just keep your 8-bit characters and redefine their meaning. The great irony is that platforms that implemented Unicode support late, such as the Linux kernel and glib/GTK+, saw all this and correctly chose UTF-8, while the early Unicode adopters (and those who must interoperate with them) are saddled with UTF-16 baggage.
charAt isn't going away. Tragically a lot of APIs, including JS APIs required for Web compatibility, rely heavily on charAt (although fortunately regexps are getting used more and more). Because it's actually indexing a UTF-16 buffer, charAt has the horrible specification "returns the n'th code unit in the UTF-16 representation of the string", and of course most code doesn't even bother trying to handle surrogate pairs correctly. Nevertheless, we're stuck with it, so we need an efficient implementation of UTF-16 charAt over a UTF-8 representation. The most obvious approach is some simple index caching: cache the results of charAt operations in (string, UTF-16 index, UTF-8 index) triples (e.g. by storing a UTF-16 index and a UTF-8 index in each string, or using a hash table). The charAt implementation can then look up the cache; "near matches" in the UTF-16 index in either direction can be used to derive the new charAt result. For simple code that's just incrementing or decrementing the index, this will work pretty well. (You'll need a little bit of hacking to handle cases when the UTF-16 index is in the middle of a surrogate pair, but it doesn't look hard.)
You could make that work even better with some compiler wizardry. You want to identify integer variables that are being used as UTF-16 indices, and associate with each such variable a "shadow" UTF-8 index for each string that the variable indexes. These are a lot like regular loop induction variables. Then you add code to keep these variables in sync and use the UTF-8 indices for the actual string operations. If you do a good job you could get rid of the UTF-16 indices entirely in many cases. This approach extends to other APIs like "find" too.
There are a lot more choices to make in a string API beyond just the character representation. For example:
- Should strings be mutable or immutable?
- Should mutable strings be append-only or randomly mutable? (I don't see a need for random-access mutation although many APIs support it)
- Should strings be able to share buffers (for fast substring operations)?
- Should strings be able to span multiple buffers (for fast concatentation)?
- Should you intern strings (i.e., force strings with the same contents to share a single buffer)? If so, when and how?
- How should string buffer memory be managed?
- How should thread safety be handled?
- Should string buffers be null terminated, or have an explicit length field, or both?
I don't have strong opinions on those questions, although I think it would be very interesting to try just having a single string type that was UTF-8, mutable but append-only, no buffer sharing, spanning, or interning, compile-time (i.e. template) configurable with a certain number of characters directly in the string (nsAutoString style), falling back to the heap, no built-in thread safety, null terminated with an explicit length in bytes. Such a string would probably use a (length, buffer size, buffer pointer, 0-or-more-auto-bytes) format, and it would be fun to try optimizing the auto-buffer case by using the buffer size and buffer pointer memory as character data.
Experiments I'd like to see:
- What percentage of memory is used by strings in Gecko (or whatever your favourite program is)?
- How much of that memory is UTF-16 code units with a high zero byte?
- How much memory would be saved (or wasted) if we used UTF-8?
- What statistical patterns are there in API usage? Is there a clear need for different string types that optimize for different patterns?
- Measure the code, time and space impacts of varying the above decisions!
The "memory" measurements should probably be memory-time products. For Gecko, you'd want to look at a variety of workloads, e.g., including a CJK-heavy page set.
Update I've updated the post to use the term "code units" where I had incorrectly mentioned "code points". Also, I want to mention that most strings manipulated by Gecko are small and so constant factors matter: simplicity of implementation is important for performance and for programmer understanding. The ability to stack-allocate small strings is also important.