Date: Wed, 6 Oct 2021 15:33:41 -0400
On 10/6/21 3:15 PM, Hubert Tong wrote:
> Tom: UTF-16 as 8-byte code units is not a valid narrow or wide
> encoding due to 0-valued code units that are not NUL.
I'm missing context for this. Perhaps you meant 8-bit code units?
I think I see what you are getting at though. UTF-16BE/LE will encode
bytes with a value of 0 that does not correspond to the null character.
Cool, UTF-16BE/LE are right out for char and wchar_t when
sizeof(wchar_t) is 1.
Tom.
>
> Jens: There's three major classes of non-Unicode wchar_t types I know
> of: padded single-byte values, EBCDIC DBCS, and ISO 2022 with planes
> linearly encoded at offsets; these are all ambiguously named due to
> the historic confusion between coded character sets versus character
> encoding schemes.
>
> General comment: I think a reinterpret_cast + iconv model is the most
> consistent with the IANA definitions and some intended uses of the
> facility (even if iconv implementations aren't there yet).
>
> Corentin: I'm still not getting what the "byte-order agnostic"
> verbiage is trying to say.
>
> re: "UTF-16" endianness, if we can take the iconv use case into
> account, then we don't want iconv to do BOM-based endianness detection
> and various iconv implementations I tried respect BOMs when told to
> process "UTF-16" input (and outputs native-endian BOM for "UTF-16"
> output).
>
> On Wed, Oct 6, 2021 at 12:07 PM Jens Maurer via SG16
> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
> On 06/10/2021 17.24, Tom Honermann wrote:
> > I disagree with that, at least in general. a UTF-16 code unit
> fits in a single byte when CHAR_BIT is >= 16.
>
> Agreed. We should keep that in mind, but I hope the result there
> will be obvious
> once we sort out wchar_t.
>
> >> wchar_t is suitable to represent any encoding that represent a
> character in N bytes (or a sequences of N bytes), for N =
> sizeof(wchar_t)/CHAR_BITS
> > Once we lift the restriction in [basic.fundamental]p8
> <http://eel.is/c++draft/basic.fundamental#8
> <http://eel.is/c++draft/basic.fundamental#8>>, yes.
>
> No, the statement about wchar_t is true today; it says "represent
> a character",
> not a "a code unit".
>
> Jens
> --
> SG16 mailing list
> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
> <https://lists.isocpp.org/mailman/listinfo.cgi/sg16>
>
> Tom: UTF-16 as 8-byte code units is not a valid narrow or wide
> encoding due to 0-valued code units that are not NUL.
I'm missing context for this. Perhaps you meant 8-bit code units?
I think I see what you are getting at though. UTF-16BE/LE will encode
bytes with a value of 0 that does not correspond to the null character.
Cool, UTF-16BE/LE are right out for char and wchar_t when
sizeof(wchar_t) is 1.
Tom.
>
> Jens: There's three major classes of non-Unicode wchar_t types I know
> of: padded single-byte values, EBCDIC DBCS, and ISO 2022 with planes
> linearly encoded at offsets; these are all ambiguously named due to
> the historic confusion between coded character sets versus character
> encoding schemes.
>
> General comment: I think a reinterpret_cast + iconv model is the most
> consistent with the IANA definitions and some intended uses of the
> facility (even if iconv implementations aren't there yet).
>
> Corentin: I'm still not getting what the "byte-order agnostic"
> verbiage is trying to say.
>
> re: "UTF-16" endianness, if we can take the iconv use case into
> account, then we don't want iconv to do BOM-based endianness detection
> and various iconv implementations I tried respect BOMs when told to
> process "UTF-16" input (and outputs native-endian BOM for "UTF-16"
> output).
>
> On Wed, Oct 6, 2021 at 12:07 PM Jens Maurer via SG16
> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
> On 06/10/2021 17.24, Tom Honermann wrote:
> > I disagree with that, at least in general. a UTF-16 code unit
> fits in a single byte when CHAR_BIT is >= 16.
>
> Agreed. We should keep that in mind, but I hope the result there
> will be obvious
> once we sort out wchar_t.
>
> >> wchar_t is suitable to represent any encoding that represent a
> character in N bytes (or a sequences of N bytes), for N =
> sizeof(wchar_t)/CHAR_BITS
> > Once we lift the restriction in [basic.fundamental]p8
> <http://eel.is/c++draft/basic.fundamental#8
> <http://eel.is/c++draft/basic.fundamental#8>>, yes.
>
> No, the statement about wchar_t is true today; it says "represent
> a character",
> not a "a code unit".
>
> Jens
> --
> SG16 mailing list
> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
> <https://lists.isocpp.org/mailman/listinfo.cgi/sg16>
>
Received on 2021-10-06 14:33:44