sg16: Re: [SG16] Agenda for the 2021-10-06 SG16 telecon

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Wed, 6 Oct 2021 09:13:28 +0200

On 06/10/2021 08.30, Hubert Tong via SG16 wrote:
> I have reviewed a version of D1885R8 that was fresh as of a few hours ago.
>
> I appreciate the prose additions. Thank you, Corentin and Jens. Much progress has been made.
>
> Some feedback below:
>
> The prose should not state that the "C" locale is associated with US-ASCII.
> Big5 and Extended_UNIX_Code_Fixed_Width_for_Japanese use fixed-width two-byte sequences. That is not equivalent to using 16-bit encoding units. In particular, correct implementations of the encoding would have lead bytes at lower addresses and the value of the same character would therefore appear as different 16-bit values depending on the endianness of the system.
> The statement that "the size of code units of individual encodings is not exposed by IANA" really ignores RFC 2978 (which still provides the framework for registration), which clearly indicates that all charsets operate on sequences of octets.

Aha. If that's true, then the IANA list wants to represent encoding schemes,
not encoding forms, and the presence of UCS-2, UCS-4 in that list is a category
error, because those encodings do not prescribe an order of octets, I believe.

And that also means the IANA list is not suitable for describing wchar_t
encodings, because wchar_t is agnostic to byte order on the specification
level. (It's just integers.)

> I think I got my answer from various reflector responses, but: The question I had started a whole thread with was how the octet values were to be extracted from C++ types that are not 8 bits, and I don't think the paper answers that question (it gives the conclusion based on an answer that is assumed and not stated). I think it would be accurate to say that some sort of object-representation based model is the one the paper advocates, so the paper should say "under an object-representation based model, 0-padded forms are distinct from the unpadded encoding".

I don't think an object-representation model is compatible with how we specify
the initialization of wchar_t string literals in the core language.
Maybe that means wide_literal() needs to go, or we specify a source
different from the IANA list for the values of wide_literal().

Hubert, are there actually any wide literal encodings that are not
Unicode-based and that are used for wchar_t? What do implementations
do here?

> To be consistent, the wide EBCDIC example should also include the size in the associated name.
>
> Concern for SG 16 to evaluate:
> The recommended practice re: UTF-16 and UTF-32 is not consistent with getting the correct treatment out of interfaces that attempt to read the wide character data as a byte stream (e.g., iconv) when there are invalid characters in a position to be confused as reverse-from-native-endian BOMs.

We should be clear whether wide_literal() talks about something suitable
as input for iconv, or as a description of how wchar_t integer values
come to be. The former doesn't exist in the standard, but the latter
does, so if we want to support the former, we should avoid any reference
to "wide literal encoding", which is a well-defined core language term.

> The recommended practice regarding (non-)use of registered encodings having single byte code units for the description of wide encodings ignores the fact that the UTF-16 encoding scheme has single byte code units (it is the encoding form that has 16-bit code units).

Unicode says that UTF-16 is (also) an encoding scheme where a BOM
defines the byte ordering (with a default if it's absent).
That makes the term "UTF-16" ambiguous, unfortunately.

> The recommended practice regarding "byte-order agnostic encodings" presumably means that the appropriately sized C++ types are expected to have the same value for a character on different platforms (regardless of the platform endianness). See above in this note re: why the non-Unicode entries don't really fit.

Yes, we need to talk about that.

> Concern for SG 16 to evaluate:
> COMP_NAME is not agnostic to Unicode normalization differences. I strongly suggest making the name-accepting constructor impose a restriction on characters outside the basic character set.

Yes, please.

> With respect to the prohibition on using id::unknown with `wide_literal`, I ask that it be lifted. There is current implementation practice for a compiler to accept a (host system) locale name whereby `mbstowcs` is used to encode wide literals (and host locales can be user-defined).

Sounds good.

Jens

> On Fri, Oct 1, 2021 at 1:40 PM Tom Honermann via SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
> *Please note that there has been a schedule change.* The previously scheduled telecon for 2021-10-13 has been moved earlier to 2021-10-06. This change was made to accommodate schedule restrictions for the author of the two papers on the agenda below. The shared calendar has been updated (which triggered the sending of new meeting invitations).
>
> SG16 will hold a telecon on Wednesday, October *6th* (not the 13th) at 19:30 UTC (timezone conversion <https://www.timeanddate.com/worldclock/converter.html?iso=20211006T193000&p1=1440&p2=tz_pdt&p3=tz_mdt&p4=tz_cdt&p5=tz_edt&p6=tz_cest>).
>
> The agenda is:
>
> * D2460R0: UTF-16 is standard practice <https://isocpp.org/files/papers/D2460R0.pdf>
> * D1885R8: Naming Text Encodings to Demystify Them <https://isocpp.org/files/papers/D1885R8.pdf>
> o Discuss and poll issues recently raised on the LEWG and SG16 mailing lists.
>
> D2460 is first on the agenda because establishing consensus on it will reduce complications for P1885. We'll plan to spend 30 minutes on D2460 and the remainder of our time on P1885.
>
> D2460R0 seeks to address SG16 issue 9 <https://github.com/sg16-unicode/sg16/issues/9> (Requiring wchar_t to represent all members of the execution wide character set does not match existing practice). Please read through the comments in that issue.
>
> P1885 is back on the agenda to discuss issues raised on the LEWG and SG16 mailing lists. The relevant email threads are linked below; there have been a lot.
>
> * SG16: Feedback re: P1885R5: Naming Text Encodings <https://lists.isocpp.org/sg16/2021/07/2490.php>
> o Naming issues (to be deferred to LEWG):
> + "mib" vs "mib_enum" vs something else.
> + Preservation of the "cs" prefix
> * SG16: P1885: Naming text encodings: Curation and provenance of aliases <https://lists.isocpp.org/sg16/2021/09/2564.php>
> o Implementation lenience with regard to registered aliases.
> o Ambiguities between encoding "standards".
> * SG16: P1885: Naming text encodings: Encodings in the environment versus registered character sets <https://lists.isocpp.org/sg16/2021/09/2579.php>
> o Latitude for implementations to consider slightly divergent encodings a match for an IANA registered character set.
> o Latitude for use of encodings such as UTF-8 with wchar_t elements.
> o Whether the IANA registry constitutes a sufficient source of identified encodings.
> * SG16: P1885: Naming text encodings: problem+solution re: charsets, octets, and wide encodings <https://lists.isocpp.org/sg16/2021/09/2584.php>
> o Encoding schemes vs encoding forms and how to map the IANA registry to encodings in C++.
> o Whether the IANA registry is fit for all the purposes for which it is being employed.
> * SG16: P1885 polling <https://lists.isocpp.org/sg16/2021/09/2633.php>
> o Relevance of IANA specified encodings to wide literal encoding.
> o Tagging of big endian vs little endian.
> * LEWG: P1885: Text encoding aliases() wording suggestion <https://lists.isocpp.org/lib-ext/2021/08/19633.php>
> o Wording recommendations courtesy of Tomasz.
> * LEWG: P1885: Naming text encodings: R7 wording feedback <https://lists.isocpp.org/lib-ext/2021/09/20198.php>
> o Requirements on encoding names.
> * LEWG: New P1885 revision, LEWG feedback applied <https://lists.isocpp.org/lib-ext/2021/09/19963.php>
> o Discussion largely captured in the threads linked above.
>
> The above threads probe fundamental concerns about the IANA registry and the goals that P1885 strives to fulfill. It probably isn't realistic to expect to resolve them all in a single telecon. Given the amount of discussion that has taken place and the possible perspectives offered, I'm no longer confident that we have a shared deep understanding of the design and intent. Specific points I want to cover include the following.
>
> * Is the IANA registry sufficient and appropriate for the identification of both the ordinary and wide literal encodings?
> * How is the IANA registry intended to be applied? Which IANA encoding would be considered a match for each of the following cases?
> o Wide literal encoding is UTF-16, sizeof(wchar_t) is 2, CHAR_BIT is >= 8, little endian architecture.
> o Wide literal encoding is UTF-16, sizeof(wchar_t) is 1, CHAR_BIT is >= 16, architecture endianness is irrelevant since code units are a single byte.
> o Wide literal encoding is UTF-16LE, sizeof(wchar_t) is 1, CHAR_BIT is >= 8, architecture endianness is irrelevant since code units are a single byte.
> * How are conflicts between the IANA registered encoding names and other names recognized by implementations to be resolved?
>
> Please feel free to suggest other topics.
>
> Tom.
>
> --
> SG16 mailing list
> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16 <https://lists.isocpp.org/mailman/listinfo.cgi/sg16>
>
>

Received on 2021-10-06 02:13:36