sg16: Re: [SG16] Agenda for the 2021-10-06 SG16 telecon

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Wed, 6 Oct 2021 11:35:35 +0200

On Wed, Oct 6, 2021 at 8:31 AM Hubert Tong via SG16 <sg16_at_[hidden]>
wrote:

> I have reviewed a version of D1885R8 that was fresh as of a few hours ago.
>
> I appreciate the prose additions. Thank you, Corentin and Jens. Much
> progress has been made.
>
> Some feedback below:
>
> The prose should not state that the "C" locale is associated with US-ASCII.
> Big5 and Extended_UNIX_Code_Fixed_Width_for_Japanese use fixed-width
> two-byte sequences. That is not equivalent to using 16-bit encoding units.
> In particular, correct implementations of the encoding would have lead
> bytes at lower addresses and the value of the same character would
> therefore appear as different 16-bit values depending on the endianness of
> the system.
> The statement that "the size of code units of individual encodings is not
> exposed by IANA" really ignores RFC 2978 (which still provides the
> framework for registration), which clearly indicates that all charsets
> operate on sequences of octets. I think I got my answer from various
> reflector responses, but: The question I had started a whole thread with
> was how the octet values were to be extracted from C++ types that are not 8
> bits, and I don't think the paper answers that question (it gives the
> conclusion based on an answer that is assumed and not stated). I think it
> would be accurate to say that some sort of object-representation based
> model is the one the paper advocates, so the paper should say "under an
> object-representation based model, 0-padded forms are distinct from the
> unpadded encoding".
>

I am happy to add that wording

> To be consistent, the wide EBCDIC example should also include the size in
> the associated name.
>

> Concern for SG 16 to evaluate:
> The recommended practice re: UTF-16 and UTF-32 is not consistent with
> getting the correct treatment out of interfaces that attempt to read the
> wide character data as a byte stream (e.g., iconv) when there are invalid
> characters in a position to be confused as reverse-from-native-endian BOMs.
>

UTF-16 is synonymous to either UTF-16BE/UTF-16LE depending on the platform.
The endianness is implied by the platform, not by text_encoding.

>
> re: "antecedent", please use "precedent" or "antecedent example for such"
>
> The recommended practice regarding (non-)use of registered encodings
> having single byte code units for the description of wide encodings ignores
> the fact that the UTF-16 encoding scheme has single byte code units (it is
> the encoding form that has 16-bit code units).
>
> The recommended practice regarding "byte-order agnostic encodings"
> presumably means that the appropriately sized C++ types are expected to
> have the same value for a character on different platforms (regardless of
> the platform endianness). See above in this note re: why the non-Unicode
> entries don't really fit.
>

No, maybe my phrasing is poor but text_encoding has no endianness
implication whatsoever.
The endianness implications come from the platform (the expectation being
that a little endian platform will use little endian to encode individual
wchar_t), and as such
a text_encoding returned from wide_literal will denote an endianness
because wchar_t has implied semantics, not because text_encoding has.

>
> Concern for SG 16 to evaluate:
> COMP_NAME is not agnostic to Unicode normalization differences. I strongly
> suggest making the name-accepting constructor impose a restriction on
> characters outside the basic character set.
>

I really do not think it's necessary, as the algorithm is well specified
regardless, but I can add a precondition

> With respect to the prohibition on using id::unknown with `wide_literal`,
> I ask that it be lifted. There is current implementation practice for a
> compiler to accept a (host system) locale name whereby `mbstowcs` is used
> to encode wide literals (and host locales can be user-defined).
>

Sure, I'm okay with that.

>
> On Fri, Oct 1, 2021 at 1:40 PM Tom Honermann via SG16 <
> sg16_at_[hidden]> wrote:
>
>> *Please note that there has been a schedule change.* The previously
>> scheduled telecon for 2021-10-13 has been moved earlier to 2021-10-06. This
>> change was made to accommodate schedule restrictions for the author of the
>> two papers on the agenda below. The shared calendar has been updated (which
>> triggered the sending of new meeting invitations).
>>
>> SG16 will hold a telecon on Wednesday, October *6th* (not the 13th) at
>> 19:30 UTC (timezone conversion
>> <https://www.timeanddate.com/worldclock/converter.html?iso=20211006T193000&p1=1440&p2=tz_pdt&p3=tz_mdt&p4=tz_cdt&p5=tz_edt&p6=tz_cest>
>> ).
>>
>> The agenda is:
>>
>> - D2460R0: UTF-16 is standard practice
>> <https://isocpp.org/files/papers/D2460R0.pdf>
>> - D1885R8: Naming Text Encodings to Demystify Them
>> <https://isocpp.org/files/papers/D1885R8.pdf>
>> - Discuss and poll issues recently raised on the LEWG and SG16
>> mailing lists.
>>
>> D2460 is first on the agenda because establishing consensus on it will
>> reduce complications for P1885. We'll plan to spend 30 minutes on D2460 and
>> the remainder of our time on P1885.
>>
>> D2460R0 seeks to address SG16 issue 9
>> <https://github.com/sg16-unicode/sg16/issues/9> (Requiring wchar_t to
>> represent all members of the execution wide character set does not match
>> existing practice). Please read through the comments in that issue.
>>
>> P1885 is back on the agenda to discuss issues raised on the LEWG and SG16
>> mailing lists. The relevant email threads are linked below; there have been
>> a lot.
>>
>> - SG16: Feedback re: P1885R5: Naming Text Encodings
>> <https://lists.isocpp.org/sg16/2021/07/2490.php>
>> - Naming issues (to be deferred to LEWG):
>> - "mib" vs "mib_enum" vs something else.
>> - Preservation of the "cs" prefix
>> - SG16: P1885: Naming text encodings: Curation and provenance of
>> aliases <https://lists.isocpp.org/sg16/2021/09/2564.php>
>> - Implementation lenience with regard to registered aliases.
>> - Ambiguities between encoding "standards".
>> - SG16: P1885: Naming text encodings: Encodings in the environment
>> versus registered character sets
>> <https://lists.isocpp.org/sg16/2021/09/2579.php>
>> - Latitude for implementations to consider slightly divergent
>> encodings a match for an IANA registered character set.
>> - Latitude for use of encodings such as UTF-8 with wchar_t
>> elements.
>> - Whether the IANA registry constitutes a sufficient source of
>> identified encodings.
>> - SG16: P1885: Naming text encodings: problem+solution re:
>> charsets, octets, and wide encodings
>> <https://lists.isocpp.org/sg16/2021/09/2584.php>
>> - Encoding schemes vs encoding forms and how to map the IANA
>> registry to encodings in C++.
>> - Whether the IANA registry is fit for all the purposes for which
>> it is being employed.
>> - SG16: P1885 polling
>> <https://lists.isocpp.org/sg16/2021/09/2633.php>
>> - Relevance of IANA specified encodings to wide literal encoding.
>> - Tagging of big endian vs little endian.
>> - LEWG: P1885: Text encoding aliases() wording suggestion
>> <https://lists.isocpp.org/lib-ext/2021/08/19633.php>
>> - Wording recommendations courtesy of Tomasz.
>> - LEWG: P1885: Naming text encodings: R7 wording feedback
>> <https://lists.isocpp.org/lib-ext/2021/09/20198.php>
>> - Requirements on encoding names.
>> - LEWG: New P1885 revision, LEWG feedback applied
>> <https://lists.isocpp.org/lib-ext/2021/09/19963.php>
>> - Discussion largely captured in the threads linked above.
>>
>> The above threads probe fundamental concerns about the IANA registry and
>> the goals that P1885 strives to fulfill. It probably isn't realistic to
>> expect to resolve them all in a single telecon. Given the amount of
>> discussion that has taken place and the possible perspectives offered, I'm
>> no longer confident that we have a shared deep understanding of the design
>> and intent. Specific points I want to cover include the following.
>>
>> - Is the IANA registry sufficient and appropriate for the
>> identification of both the ordinary and wide literal encodings?
>> - How is the IANA registry intended to be applied? Which IANA
>> encoding would be considered a match for each of the following cases?
>> - Wide literal encoding is UTF-16, sizeof(wchar_t) is 2, CHAR_BIT
>> is >= 8, little endian architecture.
>> - Wide literal encoding is UTF-16, sizeof(wchar_t) is 1, CHAR_BIT
>> is >= 16, architecture endianness is irrelevant since code units are a
>> single byte.
>> - Wide literal encoding is UTF-16LE, sizeof(wchar_t) is 1, CHAR_BIT
>> is >= 8, architecture endianness is irrelevant since code units are a
>> single byte.
>> - How are conflicts between the IANA registered encoding names and
>> other names recognized by implementations to be resolved?
>>
>> Please feel free to suggest other topics.
>>
>> Tom.
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2021-10-06 04:35:50