Date: Wed, 25 Jun 2025 22:26:16 +0100
On Tue, 24 Jun 2025 at 17:31, Niall Douglas via Liaison <
liaison_at_[hidden]> wrote:
> On 24/06/2025 13:56, Niall Douglas wrote:
>
> > This is what I am asking now - would WG14 like me to write a paper
> exploring the overheads of UTF-8 compatible variable length prefixing of
> variable length octet arrays? I think that for short strings, the overhead
> would be exactly nil, but it will rise as a percentage of total as strings
> get longer before shrinking again.
>
> Seeing as I am unemployed, I went ahead and wrote up that paper. A first
> draft is attached.
>
> See what you all make of it.
You say that the leading octet in UTF-8 has invalid values 0xF5, 0xF6,
0xF7, i.e. 1111'0101, 1111'0110, 1111'0111. The 4-octet form of the
Thompson encoding stores 21 bits, i.e. the range [0, 0x20'0000). It so
happens that Unicode only specifies values in the range [0, 0x11'0000),
which is just a little over 20 bits, and that's why the three values aren't
currently valid UTF-8. But that seems like a somewhat fragile assumption to
bake into the suggested future of the ubiquituous modern string. Who is to
say that the Unicode standard won't want to use values in [0x11'0000,
0x20'0000) at some point? Who knows, it might not be for more code points,
but maybe some other application (e.g. like the current surrogates are also
not codepoints, but take up space in the allowed range). I'm not saying
that that's in principle an unacceptable situation, but since your paper
explicitly wants strings to remain distinguishable from UTF-8, this seems
worth considering.
Best wishes,
Thomas
liaison_at_[hidden]> wrote:
> On 24/06/2025 13:56, Niall Douglas wrote:
>
> > This is what I am asking now - would WG14 like me to write a paper
> exploring the overheads of UTF-8 compatible variable length prefixing of
> variable length octet arrays? I think that for short strings, the overhead
> would be exactly nil, but it will rise as a percentage of total as strings
> get longer before shrinking again.
>
> Seeing as I am unemployed, I went ahead and wrote up that paper. A first
> draft is attached.
>
> See what you all make of it.
You say that the leading octet in UTF-8 has invalid values 0xF5, 0xF6,
0xF7, i.e. 1111'0101, 1111'0110, 1111'0111. The 4-octet form of the
Thompson encoding stores 21 bits, i.e. the range [0, 0x20'0000). It so
happens that Unicode only specifies values in the range [0, 0x11'0000),
which is just a little over 20 bits, and that's why the three values aren't
currently valid UTF-8. But that seems like a somewhat fragile assumption to
bake into the suggested future of the ubiquituous modern string. Who is to
say that the Unicode standard won't want to use values in [0x11'0000,
0x20'0000) at some point? Who knows, it might not be for more code points,
but maybe some other application (e.g. like the current surrogates are also
not codepoints, but take up space in the allowed range). I'm not saying
that that's in principle an unacceptable situation, but since your paper
explicitly wants strings to remain distinguishable from UTF-8, this seems
worth considering.
Best wishes,
Thomas
Received on 2025-06-25 21:26:31