Date: Wed, 6 Oct 2021 15:15:20 -0400
Tom: UTF-16 as 8-byte code units is not a valid narrow or wide encoding due
to 0-valued code units that are not NUL.
Jens: There's three major classes of non-Unicode wchar_t types I know of:
padded single-byte values, EBCDIC DBCS, and ISO 2022 with planes linearly
encoded at offsets; these are all ambiguously named due to the historic
confusion between coded character sets versus character encoding schemes.
General comment: I think a reinterpret_cast + iconv model is the most
consistent with the IANA definitions and some intended uses of the facility
(even if iconv implementations aren't there yet).
Corentin: I'm still not getting what the "byte-order agnostic" verbiage is
trying to say.
re: "UTF-16" endianness, if we can take the iconv use case into account,
then we don't want iconv to do BOM-based endianness detection and various
iconv implementations I tried respect BOMs when told to process "UTF-16"
input (and outputs native-endian BOM for "UTF-16" output).
On Wed, Oct 6, 2021 at 12:07 PM Jens Maurer via SG16 <sg16_at_[hidden]>
wrote:
> On 06/10/2021 17.24, Tom Honermann wrote:
> > I disagree with that, at least in general. a UTF-16 code unit fits in a
> single byte when CHAR_BIT is >= 16.
>
> Agreed. We should keep that in mind, but I hope the result there will be
> obvious
> once we sort out wchar_t.
>
> >> wchar_t is suitable to represent any encoding that represent a
> character in N bytes (or a sequences of N bytes), for N =
> sizeof(wchar_t)/CHAR_BITS
> > Once we lift the restriction in [basic.fundamental]p8 <
> http://eel.is/c++draft/basic.fundamental#8>, yes.
>
> No, the statement about wchar_t is true today; it says "represent a
> character",
> not a "a code unit".
>
> Jens
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
to 0-valued code units that are not NUL.
Jens: There's three major classes of non-Unicode wchar_t types I know of:
padded single-byte values, EBCDIC DBCS, and ISO 2022 with planes linearly
encoded at offsets; these are all ambiguously named due to the historic
confusion between coded character sets versus character encoding schemes.
General comment: I think a reinterpret_cast + iconv model is the most
consistent with the IANA definitions and some intended uses of the facility
(even if iconv implementations aren't there yet).
Corentin: I'm still not getting what the "byte-order agnostic" verbiage is
trying to say.
re: "UTF-16" endianness, if we can take the iconv use case into account,
then we don't want iconv to do BOM-based endianness detection and various
iconv implementations I tried respect BOMs when told to process "UTF-16"
input (and outputs native-endian BOM for "UTF-16" output).
On Wed, Oct 6, 2021 at 12:07 PM Jens Maurer via SG16 <sg16_at_[hidden]>
wrote:
> On 06/10/2021 17.24, Tom Honermann wrote:
> > I disagree with that, at least in general. a UTF-16 code unit fits in a
> single byte when CHAR_BIT is >= 16.
>
> Agreed. We should keep that in mind, but I hope the result there will be
> obvious
> once we sort out wchar_t.
>
> >> wchar_t is suitable to represent any encoding that represent a
> character in N bytes (or a sequences of N bytes), for N =
> sizeof(wchar_t)/CHAR_BITS
> > Once we lift the restriction in [basic.fundamental]p8 <
> http://eel.is/c++draft/basic.fundamental#8>, yes.
>
> No, the statement about wchar_t is true today; it says "represent a
> character",
> not a "a code unit".
>
> Jens
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
Received on 2021-10-06 14:15:49