Tom: UTF-16 as 8-byte code units is not a valid narrow or wide encoding due to 0-valued code units that are not NUL.

Jens: There's three major classes of non-Unicode wchar_t types I know of: padded single-byte values, EBCDIC DBCS, and ISO 2022 with planes linearly encoded at offsets; these are all ambiguously named due to the historic confusion between coded character sets versus character encoding schemes.

General comment: I think a reinterpret_cast + iconv model is the most consistent with the IANA definitions and some intended uses of the facility (even if iconv implementations aren't there yet).

Corentin: I'm still not getting what the "byte-order agnostic" verbiage is trying to say.

re: "UTF-16" endianness, if we can take the iconv use case into account, then we don't want iconv to do BOM-based endianness detection and various iconv implementations I tried respect BOMs when told to process "UTF-16" input (and outputs native-endian BOM for "UTF-16" output).

On Wed, Oct 6, 2021 at 12:07 PM Jens Maurer via SG16 <sg16@lists.isocpp.org> wrote:

On 06/10/2021 17.24, Tom Honermann wrote:
> I disagree with that, at least in general. a UTF-16 code unit fits in a single byte when CHAR_BIT is >= 16.

Agreed. We should keep that in mind, but I hope the result there will be obvious
once we sort out wchar_t.

>> wchar_t is suitable to represent any encoding that represent a character in N bytes (or a sequences of N bytes), for N = sizeof(wchar_t)/CHAR_BITS
> Once we lift the restriction in [basic.fundamental]p8 <http://eel.is/c++draft/basic.fundamental#8>, yes.

No, the statement about wchar_t is true today; it says "represent a character",
not a "a code unit".

Jens
--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16