sg16: Re: [SG16] Agenda for the 2021-10-06 SG16 telecon

From: Tom Honermann <tom_at_[hidden]>
Date: Fri, 1 Oct 2021 15:18:45 -0400

Thanks, Corentin. I didn't actually intend these questions to be posed
to you; my goal is to confirm, during the telecon, that 1) we all share
a common understanding, and 2) that we have consensus on it.

I think we have to entertain the idea that the IANA registry may not be
insufficient for what we're trying to achieve with this paper. I would
be sad if that lead you to abandon the paper, but there appears to be
some uncertainty regarding its suitability.

With regard to this question:

How is the IANA registry intended to be applied? Which IANA encoding
would be considered a match for each of the following cases?

  * Wide literal encoding is UTF-16, sizeof(wchar_t) is 2, CHAR_BIT is
>= 8, little endian architecture.
  * Wide literal encoding is UTF-16, sizeof(wchar_t) is 1, CHAR_BIT is
>= 16, architecture endianness is irrelevant since code units are a
    single byte.
  * Wide literal encoding is UTF-16LE, sizeof(wchar_t) is 1, CHAR_BIT is
>= 8, architecture endianness is irrelevant since code units are a
    single byte.

This question was intended to demonstrate a conflict or ambiguity that
does not appear to have an obvious solution. The only sensible answers
for each case are UTF16LE and UTF16 (or other or unknown). Either the
1st and the 2nd are both UTF16, or the first and the third are both
UTF16LE, or all three are UTF16. Resolving the ambiguity in order to
determine how code points are actually encoded/decoded would require
branching on sizeof(wchar_t) and/or the value of CHAR_BIT thereby
significantly reducing the utility of the feature.

Tom.

On 10/1/21 2:12 PM, Corentin Jabot wrote:
> Thanks Tom.
>
> The paper proposes in its current form the following
> recommended practices in the wording
>
> • Implementations should prefer returning UTF-16 over UTF-16BE or
> UTF-16LE.
> • Implementations should prefer returning UTF-32 over UTF-32BE or
> UTF-32LE.
> • Implementations should otherwise not consider registered encodings
> to be interchangeable [Example:Shift_JIS and Windows-31J denote
> different encodings].
> • Implementations should not refer to a registered encoding to
> describe another similar yet different non-registered encoding unless
> there is an antecedent on that implementation (Example: Big5).
> • Implementations should not refer to a registered encoding specified
> to have single byte code units to describe a wide encoding.
> • With the exceptions of UTF-16LE, UTF-32LE, UTF-32LE, UTF-32BE, wide
> registered character encodings (such as Big5,
> Extended_UNIX_Code_Fixed_Width_for_Japanese, UCS2,
> UCS4, UTF-32, UTF-16) represent byte-order agnostic encodings.
>
> Do we agree with these?
> This is the question I want to answer.
>
> Please note that I'm not willing to entertain whether IANA is or isn't
> the right design, and if you think it's not, please kill the paper.
> If you consider that wide encodings are problematic, I'm willing to
> entertain removing the wide functions, even if I do not think it's
> justified.
>
> Please further note that the limited number of wide encodings
> registered in IANA is merely illustrative of the limited existence of
> these things. Some platforms have unregistered wide encodings and the
> proposal accounts for that by allowing an implementation to return
> unregistered encodings or "unknown".
>
> Please further note that it was ALWAYS the intent of the environment
> functions to be unrelated to the execution encodings.
> environment encodings are not related to the requirements of character
> types, but of course to fit utf-16 into utf-32 you have to add padding
> which makes it a different encoding (iconv could not deal with these
> things). The intent being to be useful to users.
> Similarly, the paper does not concern itself with endianness because
> that's historically a Unicode specificity and it's not useful to users
> in the general case. Users want to know that the wide environment is
> "UTF-16" on Windows.
>
> And yes, all of this is loosely defined because with the amount of
> legacy we have to deal with prevents precision. If we want a thing
> that does not deal with legacy, we can kill this paper and use utf-8
> everywhere. Kumbaya.
>
> I would also appreciate feedback before the meeting, rather than 4
> months after.
>
> Thanks a lot,
>
> Corentin.
>
>
> On Fri, Oct 1, 2021 at 7:40 PM Tom Honermann via SG16
> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
> *Please note that there has been a schedule change.* The
> previously scheduled telecon for 2021-10-13 has been moved earlier
> to 2021-10-06. This change was made to accommodate schedule
> restrictions for the author of the two papers on the agenda below.
> The shared calendar has been updated (which triggered the sending
> of new meeting invitations).
>
> SG16 will hold a telecon on Wednesday, October *6th* (not the
> 13th) at 19:30 UTC (timezone conversion
> <https://www.timeanddate.com/worldclock/converter.html?iso=20211006T193000&p1=1440&p2=tz_pdt&p3=tz_mdt&p4=tz_cdt&p5=tz_edt&p6=tz_cest>).
>
> The agenda is:
>
> * D2460R0: UTF-16 is standard practice
> <https://isocpp.org/files/papers/D2460R0.pdf>
> * D1885R8: Naming Text Encodings to Demystify Them
> <https://isocpp.org/files/papers/D1885R8.pdf>
> o Discuss and poll issues recently raised on the LEWG and
> SG16 mailing lists.
>
> D2460 is first on the agenda because establishing consensus on it
> will reduce complications for P1885. We'll plan to spend 30
> minutes on D2460 and the remainder of our time on P1885.
>
> D2460R0 seeks to address SG16 issue 9
> <https://github.com/sg16-unicode/sg16/issues/9> (Requiring wchar_t
> to represent all members of the execution wide character set does
> not match existing practice). Please read through the comments in
> that issue.
>
> P1885 is back on the agenda to discuss issues raised on the LEWG
> and SG16 mailing lists. The relevant email threads are linked
> below; there have been a lot.
>
> * SG16: Feedback re: P1885R5: Naming Text Encodings
> <https://lists.isocpp.org/sg16/2021/07/2490.php>
> o Naming issues (to be deferred to LEWG):
> + "mib" vs "mib_enum" vs something else.
> + Preservation of the "cs" prefix
> * SG16: P1885: Naming text encodings: Curation and provenance of
> aliases <https://lists.isocpp.org/sg16/2021/09/2564.php>
> o Implementation lenience with regard to registered aliases.
> o Ambiguities between encoding "standards".
> * SG16: P1885: Naming text encodings: Encodings in the
> environment versus registered character sets
> <https://lists.isocpp.org/sg16/2021/09/2579.php>
> o Latitude for implementations to consider slightly
> divergent encodings a match for an IANA registered
> character set.
> o Latitude for use of encodings such as UTF-8 with wchar_t
> elements.
> o Whether the IANA registry constitutes a sufficient source
> of identified encodings.
> * SG16: P1885: Naming text encodings: problem+solution re:
> charsets, octets, and wide encodings
> <https://lists.isocpp.org/sg16/2021/09/2584.php>
> o Encoding schemes vs encoding forms and how to map the IANA
> registry to encodings in C++.
> o Whether the IANA registry is fit for all the purposes for
> which it is being employed.
> * SG16: P1885 polling
> <https://lists.isocpp.org/sg16/2021/09/2633.php>
> o Relevance of IANA specified encodings to wide literal
> encoding.
> o Tagging of big endian vs little endian.
> * LEWG: P1885: Text encoding aliases() wording suggestion
> <https://lists.isocpp.org/lib-ext/2021/08/19633.php>
> o Wording recommendations courtesy of Tomasz.
> * LEWG: P1885: Naming text encodings: R7 wording feedback
> <https://lists.isocpp.org/lib-ext/2021/09/20198.php>
> o Requirements on encoding names.
> * LEWG: New P1885 revision, LEWG feedback applied
> <https://lists.isocpp.org/lib-ext/2021/09/19963.php>
> o Discussion largely captured in the threads linked above.
>
> The above threads probe fundamental concerns about the IANA
> registry and the goals that P1885 strives to fulfill. It probably
> isn't realistic to expect to resolve them all in a single
> telecon. Given the amount of discussion that has taken place and
> the possible perspectives offered, I'm no longer confident that we
> have a shared deep understanding of the design and intent.
> Specific points I want to cover include the following.
>
> * Is the IANA registry sufficient and appropriate for the
> identification of both the ordinary and wide literal encodings?
> * How is the IANA registry intended to be applied? Which IANA
> encoding would be considered a match for each of the following
> cases?
> o Wide literal encoding is UTF-16, sizeof(wchar_t) is 2,
> CHAR_BIT is >= 8, little endian architecture.
> o Wide literal encoding is UTF-16, sizeof(wchar_t) is 1,
> CHAR_BIT is >= 16, architecture endianness is irrelevant
> since code units are a single byte.
> o Wide literal encoding is UTF-16LE, sizeof(wchar_t) is 1,
> CHAR_BIT is >= 8, architecture endianness is irrelevant
> since code units are a single byte.
> * How are conflicts between the IANA registered encoding names
> and other names recognized by implementations to be resolved?
>
> Please feel free to suggest other topics.
>
> Tom.
>
> --
> SG16 mailing list
> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
> <https://lists.isocpp.org/mailman/listinfo.cgi/sg16>
>

Received on 2021-10-01 14:18:48