C++ Logo

sg16

Advanced search

Title casing & word breaking

From: Fraser Gordon <fraserjgordon+cpp_at_[hidden]>
Date: Fri, 24 Feb 2023 14:16:48 -0500
Hi folks,

There was a side conversation in Wednesday's telecon about the types of
casing and segmentation/clusterization operations that we might consider
standardising. Since I have implementation experience in this area, I
wanted to offer a warning/suggestion about two operations that sound
simple: word breaking and title casing (which depends on word breaking).

For alphabetic writing systems, word breaking is pretty much what you'd
expect: keep going until you find whitespace, punctuation, any codepoint
that doesn't belong in a word. Naturally, it's a bit more complicated than
this but not too bad.

However, some non-alphabetic writing systems (like Chinese) are often
written with no word separators. The reader of the language is expected to
understand where the word breaks would fall based on plausible readings.

To demonstrate, I entered "This sentence is Chinese." into Google translate
and it provided "这是的句话是中文。" (I have no idea if this is correct but it
demonstrates the principle). Unless you can read it, the word breaks are
not obvious. The equivalent in English would be "thissentenceisenglish" -
knowing the language, you can figure it out.

The key here is that, unfortunately, you need to know the language to
perform word breaking. For electronic implementations, this amounts to
requiring a list of all possible words and then performing a fairly complex
analysis on possible segmentations. This means that the segmentation is
heuristic. An appropriate example for this mailing list would be "macroman"
- is the correct segmentation a preprocessor-based hero or a legacy text
encoding?

The dictionaries to support this tend to be large. Last time I used ICU4C,
about half of the data library was these dictionaries. This is probably not
a big deal for most cases but is a lot to ask for embedded applications
that still have some desire for Unicode transformations.

There is some good news though - the languages that require dictionaries
for word breaking are also caseless. So a "word breaking but only for
title-casing" algorithm is feasible. Such a thing could be explicitly
exposed or recommended as an implementation detail. It shouldn't be
prohibited as an implementation strategy, at least (as it'd also have a
side benefit of allowing unneeded dictionary-based analysis to be skipped).
This is actually what TR29
<https://unicode.org/reports/tr29/#Word_Boundaries> describes for word
boundaries - each ideographic character is treated as a separate word.

IIRC, the TR29 word boundaries rules are provided by the "title" break
iterator in ICU4C and while the "word" break iterator implements
language-specific dictionaries to better match user expectations of what a
word is. This means that if/when we standardise word breaking, we will need
to be clear which meaning is intended. My personal opinion would be to
follow the ICU naming to reduce user & implementer confusion.

Fraser

Received on 2023-02-24 19:17:00