C++ Logo

liaison

Advanced search

Re: [isocpp-wg14/wg21-liaison] In preparation for the Brno meeting

From: Niall Douglas <s_sourceforge_at_[hidden]>
Date: Thu, 26 Jun 2025 10:42:33 +0000
On 25/06/2025 22:26, Thomas Köppe wrote:

> You say that the leading octet in UTF-8 has invalid values 0xF5, 0xF6,
> 0xF7, i.e. 1111'0101, 1111'0110, 1111'0111. The 4-octet form of the
> Thompson encoding stores 21 bits, i.e. the range [0, 0x20'0000). It so
> happens that Unicode only specifies values in the range [0, 0x11'0000),
> which is just a little over 20 bits, and that's why the three values
> aren't currently valid UTF-8. But that seems like a somewhat fragile
> assumption to bake into the suggested future of the ubiquituous modern
> string. Who is to say that the Unicode standard won't want to use values
> in [0x11'0000, 0x20'0000) at some point? Who knows, it might not be for
> more code points, but maybe some other application (e.g. like the
> current surrogates are also not codepoints, but take up space in the
> allowed range). I'm not saying that that's in principle an unacceptable
> situation, but since your paper explicitly wants strings to remain
> distinguishable from UTF-8, this seems worth considering.

I have improved the paper to discuss the potential future use of more
than seventeen planes by Unicode. Thank you for raising it, as it is non
obvious.

As a tldr, UTF-16 can't encode more than seventeen planes. So I argue we
are safe to assume neither will UTF-8 and therefore the proposed
encoding scheme is safe to assume never will be valid UTF-8.

I have gone into rather more detail in the paper, as I know we all like
detail on this committee.

Niall

Received on 2025-06-26 10:42:35