sg16: Re: [SG16] Agenda for the 2021-10-06 SG16 telecon

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Fri, 1 Oct 2021 20:12:34 +0200

Thanks Tom.

The paper proposes in its current form the following recommended practices
in the wording

• Implementations should prefer returning UTF-16 over UTF-16BE or UTF-16LE.
• Implementations should prefer returning UTF-32 over UTF-32BE or UTF-32LE.
• Implementations should otherwise not consider registered encodings to be
interchangeable [Example:Shift_JIS and Windows-31J denote different
encodings].
• Implementations should not refer to a registered encoding to describe
another similar yet different non-registered encoding unless there is an
antecedent on that implementation (Example: Big5).
• Implementations should not refer to a registered encoding specified to
have single byte code units to describe a wide encoding.
• With the exceptions of UTF-16LE, UTF-32LE, UTF-32LE, UTF-32BE, wide
registered character encodings (such as Big5,
Extended_UNIX_Code_Fixed_Width_for_Japanese, UCS2,
UCS4, UTF-32, UTF-16) represent byte-order agnostic encodings.

Do we agree with these?
This is the question I want to answer.

Please note that I'm not willing to entertain whether IANA is or isn't the
right design, and if you think it's not, please kill the paper.
If you consider that wide encodings are problematic, I'm willing to
entertain removing the wide functions, even if I do not think it's
justified.

Please further note that the limited number of wide encodings registered in
IANA is merely illustrative of the limited existence of these things. Some
platforms have unregistered wide encodings and the
proposal accounts for that by allowing an implementation to return
unregistered encodings or "unknown".

Please further note that it was ALWAYS the intent of the environment
functions to be unrelated to the execution encodings.
environment encodings are not related to the requirements of character
types, but of course to fit utf-16 into utf-32 you have to add padding
which makes it a different encoding (iconv could not deal with these
things). The intent being to be useful to users.
Similarly, the paper does not concern itself with endianness because that's
historically a Unicode specificity and it's not useful to users in the
general case. Users want to know that the wide environment is "UTF-16" on
Windows.

And yes, all of this is loosely defined because with the amount of legacy
we have to deal with prevents precision. If we want a thing that does not
deal with legacy, we can kill this paper and use utf-8 everywhere. Kumbaya.

I would also appreciate feedback before the meeting, rather than 4 months
after.

Thanks a lot,

Corentin.

On Fri, Oct 1, 2021 at 7:40 PM Tom Honermann via SG16 <sg16_at_[hidden]>
wrote:

> *Please note that there has been a schedule change.* The previously
> scheduled telecon for 2021-10-13 has been moved earlier to 2021-10-06. This
> change was made to accommodate schedule restrictions for the author of the
> two papers on the agenda below. The shared calendar has been updated (which
> triggered the sending of new meeting invitations).
>
> SG16 will hold a telecon on Wednesday, October *6th* (not the 13th) at
> 19:30 UTC (timezone conversion
> <https://www.timeanddate.com/worldclock/converter.html?iso=20211006T193000&p1=1440&p2=tz_pdt&p3=tz_mdt&p4=tz_cdt&p5=tz_edt&p6=tz_cest>
> ).
>
> The agenda is:
>
> - D2460R0: UTF-16 is standard practice
> <https://isocpp.org/files/papers/D2460R0.pdf>
> - D1885R8: Naming Text Encodings to Demystify Them
> <https://isocpp.org/files/papers/D1885R8.pdf>
> - Discuss and poll issues recently raised on the LEWG and SG16
> mailing lists.
>
> D2460 is first on the agenda because establishing consensus on it will
> reduce complications for P1885. We'll plan to spend 30 minutes on D2460 and
> the remainder of our time on P1885.
>
> D2460R0 seeks to address SG16 issue 9
> <https://github.com/sg16-unicode/sg16/issues/9> (Requiring wchar_t to
> represent all members of the execution wide character set does not match
> existing practice). Please read through the comments in that issue.
>
> P1885 is back on the agenda to discuss issues raised on the LEWG and SG16
> mailing lists. The relevant email threads are linked below; there have been
> a lot.
>
> - SG16: Feedback re: P1885R5: Naming Text Encodings
> <https://lists.isocpp.org/sg16/2021/07/2490.php>
> - Naming issues (to be deferred to LEWG):
> - "mib" vs "mib_enum" vs something else.
> - Preservation of the "cs" prefix
> - SG16: P1885: Naming text encodings: Curation and provenance of
> aliases <https://lists.isocpp.org/sg16/2021/09/2564.php>
> - Implementation lenience with regard to registered aliases.
> - Ambiguities between encoding "standards".
> - SG16: P1885: Naming text encodings: Encodings in the environment
> versus registered character sets
> <https://lists.isocpp.org/sg16/2021/09/2579.php>
> - Latitude for implementations to consider slightly divergent
> encodings a match for an IANA registered character set.
> - Latitude for use of encodings such as UTF-8 with wchar_t elements.
> - Whether the IANA registry constitutes a sufficient source of
> identified encodings.
> - SG16: P1885: Naming text encodings: problem+solution re:
> charsets, octets, and wide encodings
> <https://lists.isocpp.org/sg16/2021/09/2584.php>
> - Encoding schemes vs encoding forms and how to map the IANA
> registry to encodings in C++.
> - Whether the IANA registry is fit for all the purposes for which
> it is being employed.
> - SG16: P1885 polling
> <https://lists.isocpp.org/sg16/2021/09/2633.php>
> - Relevance of IANA specified encodings to wide literal encoding.
> - Tagging of big endian vs little endian.
> - LEWG: P1885: Text encoding aliases() wording suggestion
> <https://lists.isocpp.org/lib-ext/2021/08/19633.php>
> - Wording recommendations courtesy of Tomasz.
> - LEWG: P1885: Naming text encodings: R7 wording feedback
> <https://lists.isocpp.org/lib-ext/2021/09/20198.php>
> - Requirements on encoding names.
> - LEWG: New P1885 revision, LEWG feedback applied
> <https://lists.isocpp.org/lib-ext/2021/09/19963.php>
> - Discussion largely captured in the threads linked above.
>
> The above threads probe fundamental concerns about the IANA registry and
> the goals that P1885 strives to fulfill. It probably isn't realistic to
> expect to resolve them all in a single telecon. Given the amount of
> discussion that has taken place and the possible perspectives offered, I'm
> no longer confident that we have a shared deep understanding of the design
> and intent. Specific points I want to cover include the following.
>
> - Is the IANA registry sufficient and appropriate for the
> identification of both the ordinary and wide literal encodings?
> - How is the IANA registry intended to be applied? Which IANA encoding
> would be considered a match for each of the following cases?
> - Wide literal encoding is UTF-16, sizeof(wchar_t) is 2, CHAR_BIT
> is >= 8, little endian architecture.
> - Wide literal encoding is UTF-16, sizeof(wchar_t) is 1, CHAR_BIT
> is >= 16, architecture endianness is irrelevant since code units are a
> single byte.
> - Wide literal encoding is UTF-16LE, sizeof(wchar_t) is 1, CHAR_BIT
> is >= 8, architecture endianness is irrelevant since code units are a
> single byte.
> - How are conflicts between the IANA registered encoding names and
> other names recognized by implementations to be resolved?
>
> Please feel free to suggest other topics.
>
> Tom.
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2021-10-01 13:12:48