Hi folks,

There was a side conversation in Wednesday's telecon about the types of casing and segmentation/clusterization operations that we might consider standardising. Since I have implementation experience in this area, I wanted to offer a warning/suggestion about two operations that sound simple: word breaking and title casing (which depends on word breaking).

For alphabetic writing systems, word breaking is pretty much what you'd expect: keep going until you find whitespace, punctuation, any codepoint that doesn't belong in a word. Naturally, it's a bit more complicated than this but not too bad.

However, some non-alphabetic writing systems (like Chinese) are often written with no word separators. The reader of the language is expected to understand where the word breaks would fall based on plausible readings.

To demonstrate, I entered "This sentence is Chinese." into Google translate and it provided "这是的句话是中文。" (I have no idea if this is correct but it demonstrates the principle). Unless you can read it, the word breaks are not obvious. The equivalent in English would be "thissentenceisenglish" - knowing the language, you can figure it out.

The key here is that, unfortunately, you need to know the language to perform word breaking. For electronic implementations, this amounts to requiring a list of all possible words and then performing a fairly complex analysis on possible segmentations. This means that the segmentation is heuristic. An appropriate example for this mailing list would be "macroman" - is the correct segmentation a preprocessor-based hero or a legacy text encoding?

The dictionaries to support this tend to be large. Last time I used ICU4C, about half of the data library was these dictionaries. This is probably not a big deal for most cases but is a lot to ask for embedded applications that still have some desire for Unicode transformations.

There is some good news though - the languages that require dictionaries for word breaking are also caseless. So a "word breaking but only for title-casing" algorithm is feasible. Such a thing could be explicitly exposed or recommended as an implementation detail. It shouldn't be prohibited as an implementation strategy, at least (as it'd also have a side benefit of allowing unneeded dictionary-based analysis to be skipped). This is actually what TR29 describes for word boundaries - each ideographic character is treated as a separate word. 

IIRC, the TR29 word boundaries rules are provided by the "title" break iterator in ICU4C and while the "word" break iterator implements language-specific dictionaries to better match user expectations of what a word is. This means that if/when we standardise word breaking, we will need to be clear which meaning is intended. My personal opinion would be to follow the ICU naming to reduce user & implementer confusion.

Fraser