Wednesday, 12 September 2007

Trust No One

When is an ASCII space (0x20) not a word separator?

When it's followed by a combining mark (e.g., COMBINING ACUTE ACCENT a.k.a. Unicode character 0x301).

According to ATSUI, anyway. Uniscribe disagrees and refuses to combine marks with space characters. It will allow combination if you stick a ZWJ (0x200D) in between. Gah!

We've also discovered that ATSUI's font fallback machinery often likes to choose different fonts for the mark and the character it combines with. Madness!

This is life working on Web browsers: the environment is so complex, any assumptions you make will be violated sooner or later.


  1. How about performing the Unicode Normalization process prior to display? (sample code available on the Unicode website) This would result in a single pre-combined character. Or would this mess up text selection, copy paste, etc?
    Anyway, no-one uses combining characters, except, perhaps other character sets like ANSEL.

  2. It's not that mad for ATSUI to choose a different font for the base-glyph and the combining-glyph, is it? If a pre-combined glyph or a suitable combining-glyph are not available in the preferred font, the next-best alternative would be to pick a base-glyph from the preferred font, and find a combining-glyph that more or less matches it.
    Grabbing a pre-combined glyph (or a base-glyph and a combining-glyph) from some other font might help ensure that the resulting character looks good on its own, but it practically guarantees that it will look awkward and eye-catching among the surrounding text.

  3. RichB: not all combining pairs have precomposed forms, so normalization won't help.
    Saying "no-one will do this in real life" doesn't work; authors do all kinds of crazy stuff and we have to handle it as best we can.
    Screwtape: the thing is, if the glyphs are from different fonts, how can the mark be correctly positioned? I assume each font contains tables showing where marks should be attached to base glyphs.