Eyes Above The Waves

Robert O'Callahan. Christian. Repatriate Kiwi. Hacker.

Tuesday 8 January 2008

String Theory

Strings are perhaps the most important data type. They're probably the second most used after fixed-width integers, but they consume far more memory. Measurements of large Java apps such as Eclipse and Websphere would often show 30% or more of the live heap being consumed by strings. Not only are strings important, there are also many degrees of freedom for designing string representations and APIs, and you see large projects making a variety of choices, even though the impact of those choices is not widely understood. I believe strings are hugely understudied and there's a lot of interesting PhD-level research that could be done on strings. Tragically, strings (like email and parsing) are considered "old hat" and unlikely to attract top-class research attention. Anyway, let me jot down a few thoughts that have been festering for a long time...

UTF-16 is the devil's work. Once upon a time the Unicode consortium promised that 16 bits would be enough space for all the characters of the world's languages. Everyone believed it and, since 16 bits is not too much more than the 8 bits people were already using for Latin text, everyone designed their "Unicode" APIs to work with 16-bit characters, because that was simpler than variable-width representations. Later, though, the consortium realized they'd made a wee mistake, 16 bits wasn't going to be enough. No-one wanted to change all their APIs and using 32 bits per character is a hefty penalty for Latin text so everyone redefined their 16-bit units to mean "UTF-16 code units", i.e., some characters are represented with two code units ("surrogate pairs"). The problem is, UTF-16 thus has basically the same complexity as a good variable-width format such as UTF-8, but uses double the memory of UTF-8 for Latin text.

No-one really needs charAt. One of the major touted advantages of a uniform character size is that it makes charAt(index) trivially efficient. However, almost all the code I've ever seen that indexes into strings is actually iterating through strings (or using APIs that return indexes into strings, which could just as easily use iterators). Implementing efficient bidirectional iterators for UTF-8 is trivial. You only really need fast charAt for random access into strings, and the only code I know of that needs it is Boyer-Moore string search --- but that can be easily implemented efficiently for UTF-8 just by searching over UTF-8 code units instead of Unicode characters.

UTF-8 rules. UTF-16 is nearly as complex as UTF-8. UTF-16 is actually worse to code for because surrogate pairs are extremely rare, much rarer than multibyte UTF-8 characters, so you get code paths that are not well tested or code where surrogate pairs are not handled at all. For Latin text UTF-8 uses only half the space (note that even with non-Latin languages, programs still manipulate lots of Latin text internally), and you have the excellent property that ASCII text is valid UTF-8. In performance-critical code (such as our DOM implementation) you often see optimizations that switch between UTF-16 or "UCS-1" (text that's all characters 0-255, 8 bits per character), with code duplication for the "8-bit path" and the "UTF-16 path". With UTF-8 none of this is necessary, which is really good --- code duplication is evil, it requires new families of string types, forces programmers to make choices about string representations, and requires conversions that are potentially costly in code and time. UTF-8 is also easier to migrate to from ASCII, you just keep your 8-bit characters and redefine their meaning. The great irony is that platforms that implemented Unicode support late, such as the Linux kernel and glib/GTK+, saw all this and correctly chose UTF-8, while the early Unicode adopters (and those who must interoperate with them) are saddled with UTF-16 baggage.

charAt isn't going away. Tragically a lot of APIs, including JS APIs required for Web compatibility, rely heavily on charAt (although fortunately regexps are getting used more and more). Because it's actually indexing a UTF-16 buffer, charAt has the horrible specification "returns the n'th code unit in the UTF-16 representation of the string", and of course most code doesn't even bother trying to handle surrogate pairs correctly. Nevertheless, we're stuck with it, so we need an efficient implementation of UTF-16 charAt over a UTF-8 representation. The most obvious approach is some simple index caching: cache the results of charAt operations in (string, UTF-16 index, UTF-8 index) triples (e.g. by storing a UTF-16 index and a UTF-8 index in each string, or using a hash table). The charAt implementation can then look up the cache; "near matches" in the UTF-16 index in either direction can be used to derive the new charAt result. For simple code that's just incrementing or decrementing the index, this will work pretty well. (You'll need a little bit of hacking to handle cases when the UTF-16 index is in the middle of a surrogate pair, but it doesn't look hard.)

You could make that work even better with some compiler wizardry. You want to identify integer variables that are being used as UTF-16 indices, and associate with each such variable a "shadow" UTF-8 index for each string that the variable indexes. These are a lot like regular loop induction variables. Then you add code to keep these variables in sync and use the UTF-8 indices for the actual string operations. If you do a good job you could get rid of the UTF-16 indices entirely in many cases. This approach extends to other APIs like "find" too.

There are a lot more choices to make in a string API beyond just the character representation. For example:

  • Should strings be mutable or immutable?
  • Should mutable strings be append-only or randomly mutable? (I don't see a need for random-access mutation although many APIs support it)
  • Should strings be able to share buffers (for fast substring operations)?
  • Should strings be able to span multiple buffers (for fast concatentation)?
  • Should you intern strings (i.e., force strings with the same contents to share a single buffer)? If so, when and how?
  • How should string buffer memory be managed?
  • How should thread safety be handled?
  • Should string buffers be null terminated, or have an explicit length field, or both?

I don't have strong opinions on those questions, although I think it would be very interesting to try just having a single string type that was UTF-8, mutable but append-only, no buffer sharing, spanning, or interning, compile-time (i.e. template) configurable with a certain number of characters directly in the string (nsAutoString style), falling back to the heap, no built-in thread safety, null terminated with an explicit length in bytes. Such a string would probably use a (length, buffer size, buffer pointer, 0-or-more-auto-bytes) format, and it would be fun to try optimizing the auto-buffer case by using the buffer size and buffer pointer memory as character data.

Experiments I'd like to see:

  • What percentage of memory is used by strings in Gecko (or whatever your favourite program is)?
  • How much of that memory is UTF-16 code units with a high zero byte?
  • How much memory would be saved (or wasted) if we used UTF-8?
  • What statistical patterns are there in API usage? Is there a clear need for different string types that optimize for different patterns?
  • Measure the code, time and space impacts of varying the above decisions!

The "memory" measurements should probably be memory-time products. For Gecko, you'd want to look at a variety of workloads, e.g., including a CJK-heavy page set.

Update I've updated the post to use the term "code units" where I had incorrectly mentioned "code points". Also, I want to mention that most strings manipulated by Gecko are small and so constant factors matter: simplicity of implementation is important for performance and for programmer understanding. The ability to stack-allocate small strings is also important.


Kenneth Waters
After spending a lot of time looking at various ways to represent strings, I'm a big fan of ropes for a non-interned string representation. Ropes are immutable, have O(1) substring and concatination, and are thread-safe. They tend to fragment memory and work best with GC though.
Interned strings are odd, I think X11 does a great job with interned strings, but that seems like a niche use. Someone should see how much contention there is on the mutexs around Java's string interning table in a big multi-threaded app. How often are you really comparing strings or using them as dictionary keys? I suppose you could just store a hash of the string in it's representation, that would get you fast hashing and comparing (although not as fast as interning).
Robert O'Callahan
There have been studies of schemes that intern strings during GC. You can use a good concurrent hash table to get around the threading issues. Not a bad idea but interning in general requires an immutable string API, and I'm not sure that's the way to go.
Ropes are pretty heavyweight. In Gecko, most of the time we're working with short strings and I fear ropes are not very efficient for that (although I'd like to see some experiments). Constant factors matter.
Darin Fisher
Isn't UTF-16 a more compact representation for many scripts? CJK in particular.
Marko Macek
I like:
- immutable
- non-zero terminated
- shared/refcount (fast substring/split/iteration)
- UTF-8
- a separate append-only StringBuffer for incremental creation
I'm not sure if thread safety is needed.
This is a very nice model for C++ code, but I'm not sure if it's useful for higher layer code (javascript).
Robert O'Callahan
Darin: yeah, but look at the source for a Chinese Web page (e.g. mingpao.com). There's far more ASCII than Chinese (often mixed in the same DOM nodes, even). And even for the CJK characters, we're only talking about 3 bytes per character instead of 2. We should measure it but I doubt there's a big win for UTF-16.
Marko: it turns out that using the same representation for Javascript as for C++ would provide a nice speedup for DOM API calls.
Robert O'Callahan
PS Darin, you should write a retrospective on the design and evolution of XPCOM strings :-).
Would it be worthwhile having a ASCII/UTF-8 flag bit on your string so that you can do fast CharAt, ToLowerCase etc. on ASCII strings? (As I recall ToLowerCase is slow on UTF-8, as you have to convert K to k for instance.)
Isn't a "code point" always the same, regardless of encoding? E.g. U+10000 "LINEAR B SYLLABLE B008 A": its code point is 10000. Its representation in UTF-16 is composed of two code *units*, 0xd800 and 0xdc00. Likewise its UTF-8 representation has four code units, 0xf0 0x90 0x80 0x80.
If the specification of charAt really refers to "code points" then that onerous hackery with UTF-16 indexing isn't necessary and you can just find the code point normally from the UTF-8 representation.
If, however, it refers to code units, then it's as bad as you say.
And lastly, if it refers to something as ambiguous as a "character" you can do whatever you wish. But I suppose you knew this already. ;-)
Robert O'Callahan
Deewlant: sorry, everywhere I said "code point" I meant "code unit". I'll fix that.
Neil: that might be useful, but it has the unfortunate effect that code that's O(N) on ASCII text could become O(N^2) as soon as a non-ASCII character is tested. That's a bit too surprising for the programmer for my taste. Not sure what the problem is with ToLowerCase in UTF-8, it's linear time at least and you could easily implement an approach that's in-place for ASCII text (just scan along doing ToLowerCase in place, and when you find the first non-ASCII character you start a copying approach at that point).
roc, what are your thoughts re: interning on the setup Gecko actually uses on the C++ side? That is, the ability to intern strings as needed (nsIAtom)?
I admit that it kinda sucks that certain APIs require the use of interned strings while others require non-interned ones, and that interning often includes charset conversions. In some ways it would be very nice if our interned strings were simply a subclass of nsString or some such, with an overridden operator== and possibly lazy internment as needed or something.
Robert O'Callahan
The value of automatic interning is highly workload dependent --- my research on "Object Equality Profiling" showed that, although it's obvious anyway. The first incarnation of .NET did aggressive automatic interning but Microsoft backed off in later versions.
But yes, it would be super if manual interning gave you a regular string or something compatible with a regular string. That should be easy, just have nsAtom contain an nsString and offer a method returning a const nsString&. If our strings were UTF-8 then this would work great.
Leon Mergen
In my opinion, thread safety should be up to the programmer who uses the string class. Like you already stated, strings are used a LOT in programs, and thread safety is not an issue in a lot of cases -- a lot of unnecessary overhead would take places where it will not be used, and in the places that it *does* get used, other locking mechanisms will probably have to be used anyway.
Other than that, i *really* like the idea of shared buffers between strings, for fast copy, substring and concat. Memory can be saved, and the heap is used less often. Anyone is aware of existing implementations of that yet ?
Robert O'Callahan
I agree that thread safety should not be part of a string API.
Shared buffers add complexity, especially when you go to strings that can span multiple buffers (for efficient concatenation). We used to support buffer-spanning strings in Gecko but Darin took them out, for a significant perf win. I would really like to see data that shows whether/when buffer sharing is worthwhile. GC and refcounting environments let you share strings trivially, and C++ const references let you share strings a lot too, so I'm not sure that the added benefit of buffer sharing (for fast substring, mostly) is worth it.
If strings can share buffers, then they have to provide thread-safety also: What happens when two different string variables, sharing the same buffer, are modified concurrently? User code can’t know that they share the same buffer and need to be locked.
Robert O'Callahan
That's true and is a good argument against buffer sharing, IMHO. Unless you make strings immutable as Java and .NET do.
We really didn't need to know that.
Marko Macek
Here's an implementation that's useful if you like doing lot's of substring/split/parsing:
It doesn't do cheap concat, but as said above, it's often not an optimization. It does cheap substring and split.
It's a ref-counted buffer, used by the string struct that handles substrings, ... Some performance optimization of refcounting is probably needed (for iteration, mostly).
It's not currently UTF-8/unicode oriented (uses plain bytes), but that could be fixed.
The strings are immutable (Since javascript has immutable strings, I see no downside to this).
It's probably sensible to always do a 'copy' when you have a substring and are assigning a new value to a class member (as opposed to stack assignment/passing, I wish c++ had a way to differentiate this).
The 'cstring' class is used for passing strings to C functions and uses a trick: while the base string class is not guaranteed to be zero terminated it actually does always allocate one byte extra for '0', removing the need to do a 'copy' + append '0' in most cases (for safety, you would probably need to check if there's an early zero-terminator).
For threading there are two issues:
- atomic reference counting (easily added, but costs performance).
- atomic assignment is not possible, since this is a struct (this may or may not be a problem).
For mutable strings, there should be a separate class that has nothing shared (but potentially allows the string class to steal it's buffer without copying).
Well, that's my take on string theory.
"the only code I know of that needs it is Boyer-Moore string search --- but that can be easily implemented efficiently for UTF-8 just by searching over UTF-8 code units instead of Unicode characters"
Hmm... Isn't the whole point of boyer-moore is to not scan, say, characters 0 to 17 when checking the 18th is sufficient? Scanning bytes to check for utf-8 multi-byte units seems to be as expensive as checking bytes for equalness, whose avoidance was the specific purpose of boyer-moore.
Why is it that Christian offends you, but Kiwi doesn't? Thats part of his identity and to him Christian is as important as Kiwi or hacker
Go create a blog with "Atheist" or whatever you identify with. Don't bug others because you don't like what they identity with.
Robert O'Callahan
Jerome: what I'm suggesting is that you convert the search string and the searched string to UTF-8 and then run Boyer-Moore over those bytes. It should work reasonably well because if the input Unicode characters are fairly random the UTF-8 bytes are also reasonably well-distributed. (It wouldn't work if UTF-8 used, say, 0xFF a lot to indicate something special.)
Robert O'Callahan
Hmm, but I guess if you use a lot of characters from the same small Unicode range (pretty common), you'll get a single prefix byte that will occur very frequently and stuff up Boyer-Moore.
On the other hand though, if you did implement Boyer-Moore for Unicode characters, you'd need to build some huge tables. So I don't think Boyer-Moore really works well for Unicode no matter how you do it.
Aside from charAt(), code units are exposed via the .length property, charCodeAt(), indexOf()/lastIndexOf(), slice()/substring(), RegExp.lastIndex, and the .index property of arrays returned by RegExp.exec().
Also, in ECMAScript Edition 4, it is proposed that strings will be indexable, sort of like in Python:
"hello world"[6] == "w"
Of course all these methods can use the same cache of string offsets.
But no matter what you do here, all this stuff will be slower than an implementation that just uses UTF-16. Just not O(N) slower.
Robert O'Callahan
> Also, in ECMAScript Edition 4, it is proposed
> that strings will be indexable, sort of like in
> Python:
> "hello world"[6] == "w"
That's evil! That basically forces implementations to use UCS-4 if you decree that JS characters are Unicode characters, or UTF-16 if you decree that they're UTF-16 code units. (The latter case is really bad though because it would let people create invalid surrogate pairs or else give you weird error cases to check and handle.) I really strongly encourage you to revisit that decision.
> But no matter what you do here, all this stuff
> will be slower than an implementation that just
> uses UTF-16. Just not O(N) slower.
Not necessarily. For ASCII text UTF-8 is more compact and therefore a performance win.
Newspaper-style text columns on webpages are the devil's work. And three or more is just taking the piss...
Jason Orendorff
> That's evil!
Maybe, but the cat was already out of the bag. ECMAScript Edition 3 explicitly allows you to create invalid surrogate pairs. '(\ud800)', for example, is allowed. ES3 strings are unrestricted sequences of 16-bit values, by spec. I'm pretty sure Edition 4 will not change this.
Anyway, you can encode unpaired surrogates in UTF-8, if you don't mind having total garbage in your strings.
> For ASCII text UTF-8 is more compact and therefore a performance win.
Right, plus you probably convert less. By "all this stuff" I just meant the operations mentioned--the ones that are a fast array lookup in UTF-16 and a hash table lookup plus in UTF-8.
Joshua Cranmer
Some points about strings from my experience with Java.
Java, I feel, has the best string implementation of major languages. Much of this is probably due to it having one string type and not several (char *, wchar_t *, string, wstring), although some people have complained about this. I would like to point out that said complaints are most often in the context of Java cannot print out 100,000 lines of output as quickly as C, but I digress.
It is my opinion that strings should be immutable, like any other type of object whose primary usage lies in communication between different steps of the system while being retained in the state. However, an efficient builder should be provided (like Java's StringBuilder and not like StringBuffer--the builder should be optimized for speed and not thread-safety).
Random access is a bit of a thorny point. Coming from a Java background, I have a tendency to use code like the following:
int index = reference.indexOf(":");
String protocol = reference.substring(0,index);
String rest = reference.substring(index+1);
That sort of usage is one that API designers must expect people to use and abuse. I would think that caching the result of a lookup for a random access would improve performance, since most usages of a random-access would be immediately followed by an operation around said access point.
Robert O'Callahan
That's not random access. Think about it: first you scan from the start of the string looking for a ":". Then you make a substring from the start of the string up to the point you got to, and another substring from that point to the end. You could just as easily write
StringIterator index = reference.indexOf(":");
String protocol = reference.substring(0,index);
String rest = reference.substring(index.next());
This would work just fine with UTF8 and no caching, StringIterator just keeps its internal index in bytes.
Encoding unpaired surrogates in UTF-8 probably deserves quite a few evilness point, but nonetheless is quite well described use pattern with a name "CESU-8" and a dedicated unicode technical report ;-)
The doctor's prescription is internal use only ;-)
It's certainly the best solution to keep API level UTF-16 compatibility without performance cost.
Phil Endecott
Hi Robert,
Thanks for an interesting post, which happens to cover exactly the subjects that I have been worrying about recently. I'm the author of Anyterm, which is a terminal on a web page; see the demos at http://anyterm.org/demos.html. It has a C++ server that does terminal emulation, converts the screen to HTML, and then diffs new screens against old; the "edit script" from the diff is sent to the Javascript in the browser which "patches" the HTML terminal that it's showing. There are lots of strings in there, and in the latest release I finally got around to making it work properly with different character sets (or so I thought).
So, first of all with my Javascript hat on, I'm really horrified to learn that JS string indices are UTF-16 code units! Ugh! My code certainly doesn't get that right. Is it really too late to go and fix that? So Firefox will let me chop a string in two:
s1 = s.substr(0,n); s2 = s.substr(n);
giving potentially invalid strings if the split was in the middle of a UTF-16 character, and then join them back together:
assert(s1+s2 == s);
and have what I started with??? And how many other browsers get that right??? I would guess that the amount of code that's broken in the face of that situation must be greater than the code that relies on it, so "fixing" the spec would probably make more sites work, not fewer.
With my C++ user's hat on: partly because of this project, I decided to put together a "string tagged with its character set" class, and I've discussed this on the Boost list. In my proposal, a UTF-8 string would have three types of iterator:
- A mutable random-access iterator over units (i.e. bytes).
- A const bidirectional iterator over characters.
- A mutable output ("append") iterator over characters.
This gives you everything that you can do with O(1) iterator de-referencing, and nothing that you can't.
You can read about this if you search the Boost list archive from last September and October, and look for my name.
Sorry this has got so long. Thanks for the thought-provoking article.
Robert O'Callahan
> Is it really too late to go and fix that?
Yes. Indexing into strings is interpreted as UTF-16 code units, Web compatibility depends on it.
> And how many other browsers get that right???
The major ones do, I guess.
Your proposal sounds good, except I'd avoid having strings lying around with different character sets. Other character sets should be converted into UTF-8 as early as possible.
Returning to Neil's idea, would you not assume each string is UTF-8 to begin. But if and when code determines a string is "ASCII", e.g. to implement the functions Jason cites, then set the ASCII (more accurately, eachCodeUnitInHereIsOneByte?) flag.
I seem to recall there was a lot of cleanup of the strings in Gecko maybe around 1.7 (nsACString, nsAutoString, NS_NAMED_LITERAL, etc.) , but I can't find the informative posts from back then. http://developer.mozilla.org/en/docs/XPCOM_string_guide is still hella complex.
Ahh, strings. My personal opinion is that there are only two representations that make any sense at all. UCS-4 and UTF-8.
UTF-16 is actually just what happened when everyone looked at the initial Unicode spec and thought that what is now the BMP was all anyone would even need. Then it turned out it wasn't, but all that software from the early to mid 90s was stuck with UCS-2, and so the move to UTF-16 (vs. a move to UCS-4) looked like the lesser of two evils. Plus if you had clients relying on your APIs, you were painted into a corner more still.
But if I look at the average amount of RAM that a computer had in 1991 (which is when Unicode 1.0 came out), it was 4M. I'm sure some people suggested that 32-bit characters would be OK back then, but were shouted down by the (then legitimate) concern that we didn't have that kind of memory to burn. So UCS-2 was a pragmatic choice. But today? The machines I'm ordering have 4G in them - three magnitudes larger. Is doubling the size of the basic character unit really a problem? I suspect it's just an vague feeling of wastefulness that has people still against it, whereas the reality is that UCS-4 would make a minimal impact on the vast majority of programs. And, as roc implies, UTF-16 is evil.
UTF-8 is an interesting direction as well. Well internally anyway - the world is, for the most part, now UTF-8 externally, and that is as it should be. But to base your app on UTF-8 strings? Definitely a rarer move, although I remember when Bram took Vim Unicode by moving the internal representations to UTF-8. It turns out that it's technically a fine decision. Most people will recoil from the idea, but I think their opinions are based on feeling rather than fact, much as with UCS-4.
None of this alleviates the issues you have to deal with with Unicode. Array indexing for actual characters in the face of combining diacriticals, and other wackiness, is not going to be made easy like it is for ASCII no matter what string format you choose. There's still an enormous amount of work that Unicode forces you to do, but at least you shouldn't have to put up with UTF-16 at the same time. Fast and slow paths is one way to go to remove the performance penalty in 99% of strings, but there are many other ways to address the low level issues.
Robert O'Callahan
> UTF-16 is actually just what happened when
> everyone looked at the initial Unicode spec
You're restating what I wrote, right?
> the reality is that UCS-4 would make a minimal
> impact on the vast majority of programs
I don't agree. Like I said, measurement of large apps shows a very significant fraction of the live heap is strings. Doubling that would cause considerable pain.
> Array indexing for actual characters in the
> face of combining diacriticals
It turns out you never really have to do that, that is a corollary of my "No-one really needs charAt" paragraph.
> You're restating what I wrote, right?
I like to think of it as "agreeing with you".
Only with a bit of historical perspective to indicate that the initial UCS-2 decision was made at a time when memory was microscopic in comparison to today. 4M of RAM total in a 1991 machine. The Firefox 3 beta in front of me takes 10 times that or more just to start up today.
As Maximillian Cohen once said, "11:15, restate my assumptions."
String s = "nabraj";
is this initialization static or dynamic? how it works internally, because String is a class and we are just assigning literals to it.
would you please clarify this?
I'm just waiting for the day it becomes clear that sixteen planes is not enough. UTF-8 and UCS-4 as originally specified wouldn't even blink, but we had to go break them for compatibility with UCS-2/UTF-16...
Just reject adding extra characters.
After they did it with Klingon, I prefer using only ASCII
Summary of this thread:
UTF-8 is nice, but due to broken JavaScript API (and other) specs, UTF-16 (nee UCS2) is required for performance.
I do prefer immutable strings.