On 24/06/2025 13:56, Niall Douglas wrote:
> This is what I am asking now - would WG14 like me to write a paper exploring the overheads of UTF-8 compatible variable length prefixing of variable length octet arrays? I think that for short strings, the overhead would be exactly nil, but it will rise as a percentage of total as strings get longer before shrinking again.
Seeing as I am unemployed, I went ahead and wrote up that paper. A first draft is attached.
See what you all make of it.
You say that the leading octet in UTF-8 has invalid values 0xF5, 0xF6, 0xF7, i.e. 1111'0101, 1111'0110, 1111'0111. The 4-octet form of the Thompson encoding stores 21 bits, i.e. the range [0, 0x20'0000). It so happens that Unicode only specifies values in the range [0, 0x11'0000), which is just a little over 20 bits, and that's why the three values aren't currently valid UTF-8. But that seems like a somewhat fragile assumption to bake into the suggested future of the ubiquituous modern string. Who is to say that the Unicode standard won't want to use values in [0x11'0000, 0x20'0000) at some point? Who knows, it might not be for more code points, but maybe some other application (e.g. like the current surrogates are also not codepoints, but take up space in the allowed range). I'm not saying that that's in principle an unacceptable situation, but since your paper explicitly wants strings to remain distinguishable from UTF-8, this seems worth considering.
Best wishes,
Thomas