sg16: [SG16] P1885 and CHAR

From: Corentin <corentin.jabot_at_[hidden]>
Date: Wed, 20 Oct 2021 09:28:10 +0200

Hello Folks.
People raised concerns that P1885 did not address a specific scenario in
which a platform
* Has CHAR_BIT >= 16
* Uses UTF-16 as one of its encoding.

My initial reaction was to handle this case specifically, and to have it
return unknown.

However, I have realised than there is no non-freestanding platform
that defines CHAR_BIT to be 16 or more, and P1885 is not freestanding
currently.

Furthermore, there is no evidence than the supposedly problematic scenario
is anything but theoretical, even on these freestanding platforms.

As such, I would prefer say nothing about CHAR_BIT and its relation to
UTF-16.
We may still want to maybe define "encoding scheme", although we don't
define "encoding" afaik.

I am opposed to using the term "encoding form" as this term is not defined
nor meaningful for non-Unicode encoding, and because it would walk back on
previous SG-16 guidance with no new information beside "We don't know
what to do about DSPs"

I have no idea if I'll be there tonight, being at NDC and all, but, here
what I am willing to consider:

   - Doing nothing, as I believe this is correct for all scenari, including
   the hypothetical scenario in which a vendor would like to support this
   class on a CHAR_BIT >= 16 platform, where the vendor would be free to
   choose if and how they map to UTF encodings. I am not concerned about
   portability in this case. Please consider the use cases on these platforms,
   or lack thereof.
   - Make the static functions return unknown when CHAR_BIT >= 16.
   - Mandate CHAR_BIT < 16 - (The way to implement that would be to make
   the functions deleted on platforms where CHAR_BIT >= 16).

In addition we can

   - Use the term "Encoding" instead of "encoding scheme", which I think
   would probably be the sanest direction. Of course we should keep talking
   about "the encoding of the object representation"
   - Provide a definition of "encoding scheme", as long as it is consistent
   with existing definitions.

Things that I am unwilling to consider

   - Using the term "encoding form", which is Unicode specific
   - Trying to pin point the semantics of UTF-16 on CHAR_BIT=16 platforms
   without proof of implementation experience.

The paper is here https://isocpp.org/files/papers/P1885R8.pdf
Jens noted a few issues with wording in a thread or another, that will be
addressed later.

In addition, Jens asked whether UCS-2 and UCS-4 should also be explicitly
noted to be in the native endianness instead of big endian (IANA define
them as being big endian). These being deprecated, I do not think a mention
is necessary but maybe you will want to poll that so that we can put this
to rest.

Thanks,

Regards,
Corentin

Received on 2021-10-20 02:28:24