sg16: Re: [SG16] Agenda for the 2021-10-06 SG16 telecon

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Wed, 6 Oct 2021 15:36:10 -0400

On Wed, Oct 6, 2021 at 3:33 PM Tom Honermann <tom_at_[hidden]> wrote:

> On 10/6/21 3:15 PM, Hubert Tong wrote:
>
> Tom: UTF-16 as 8-byte code units is not a valid narrow or wide encoding
> due to 0-valued code units that are not NUL.
>
> I'm missing context for this. Perhaps you meant 8-bit code units?
>
Yes :facepalm:

> I think I see what you are getting at though. UTF-16BE/LE will encode
> bytes with a value of 0 that does not correspond to the null character.
> Cool, UTF-16BE/LE are right out for char and wchar_t when sizeof(wchar_t)
> is 1.
>
> Tom.
>
>
> Jens: There's three major classes of non-Unicode wchar_t types I know of:
> padded single-byte values, EBCDIC DBCS, and ISO 2022 with planes linearly
> encoded at offsets; these are all ambiguously named due to the historic
> confusion between coded character sets versus character encoding schemes.
>
> General comment: I think a reinterpret_cast + iconv model is the most
> consistent with the IANA definitions and some intended uses of the facility
> (even if iconv implementations aren't there yet).
>
> Corentin: I'm still not getting what the "byte-order agnostic" verbiage is
> trying to say.
>
> re: "UTF-16" endianness, if we can take the iconv use case into account,
> then we don't want iconv to do BOM-based endianness detection and various
> iconv implementations I tried respect BOMs when told to process "UTF-16"
> input (and outputs native-endian BOM for "UTF-16" output).
>
> On Wed, Oct 6, 2021 at 12:07 PM Jens Maurer via SG16 <
> sg16_at_[hidden]> wrote:
>
>> On 06/10/2021 17.24, Tom Honermann wrote:
>> > I disagree with that, at least in general. a UTF-16 code unit fits in a
>> single byte when CHAR_BIT is >= 16.
>>
>> Agreed. We should keep that in mind, but I hope the result there will be
>> obvious
>> once we sort out wchar_t.
>>
>> >> wchar_t is suitable to represent any encoding that represent a
>> character in N bytes (or a sequences of N bytes), for N =
>> sizeof(wchar_t)/CHAR_BITS
>> > Once we lift the restriction in [basic.fundamental]p8 <
>> http://eel.is/c++draft/basic.fundamental#8>, yes.
>>
>> No, the statement about wchar_t is true today; it says "represent a
>> character",
>> not a "a code unit".
>>
>> Jens
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>
>

Received on 2021-10-06 14:36:39