Thanks Tom.

The paper proposes in its current form the following recommended practices in the wording

• Implementations should prefer returning UTF-16 over UTF-16BE or UTF-16LE.
• Implementations should prefer returning UTF-32 over UTF-32BE or UTF-32LE.
• Implementations should otherwise not consider registered encodings to be interchangeable [Example:Shift_JIS and Windows-31J denote different encodings].
• Implementations should not refer to a registered encoding to describe another similar yet different non-registered encoding unless there is an antecedent on that implementation (Example: Big5).
• Implementations should not refer to a registered encoding specified to have single byte code units to describe a wide encoding.
• With the exceptions of UTF-16LE, UTF-32LE, UTF-32LE, UTF-32BE, wide registered character encodings (such as Big5, Extended_UNIX_Code_Fixed_Width_for_Japanese, UCS2,
UCS4, UTF-32, UTF-16) represent byte-order agnostic encodings.

Do we agree with these?

This is the question I want to answer.

Please note that I'm not willing to entertain whether IANA is or isn't the right design, and if you think it's not, please kill the paper.

If you consider that wide encodings are problematic, I'm willing to entertain removing the wide functions, even if I do not think it's justified.

Please further note that the limited number of wide encodings registered in IANA is merely illustrative of the limited existence of these things. Some platforms have unregistered wide encodings and the

proposal accounts for that by allowing an implementation to return unregistered encodings or "unknown".

Please further note that it was ALWAYS the intent of the environment functions to be unrelated to the execution encodings.

environment encodings are not related to the requirements of character types, but of course to fit utf-16 into utf-32 you have to add padding which makes it a different encoding (iconv could not deal with these things). The intent being to be useful to users.

Similarly, the paper does not concern itself with endianness because that's historically a Unicode specificity and it's not useful to users in the general case. Users want to know that the wide environment is "UTF-16" on Windows.

And yes, all of this is loosely defined because with the amount of legacy we have to deal with prevents precision. If we want a thing that does not deal with legacy, we can kill this paper and use utf-8 everywhere. Kumbaya.

I would also appreciate feedback before the meeting, rather than 4 months after.

Thanks a lot,

Corentin.

On Fri, Oct 1, 2021 at 7:40 PM Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:

Please note that there has been a schedule change. The previously scheduled telecon for 2021-10-13 has been moved earlier to 2021-10-06. This change was made to accommodate schedule restrictions for the author of the two papers on the agenda below. The shared calendar has been updated (which triggered the sending of new meeting invitations).

SG16 will hold a telecon on Wednesday, October 6th (not the 13th) at 19:30 UTC (timezone conversion).

The agenda is:

D2460R0: UTF-16 is standard practice

D1885R8: Naming Text Encodings to Demystify Them

Discuss and poll issues recently raised on the LEWG and SG16 mailing lists.

D2460 is first on the agenda because establishing consensus on it will reduce complications for P1885. We'll plan to spend 30 minutes on D2460 and the remainder of our time on P1885.

D2460R0 seeks to address SG16 issue 9 (Requiring wchar_t to represent all members of the execution wide character set does not match existing practice). Please read through the comments in that issue.

P1885 is back on the agenda to discuss issues raised on the LEWG and SG16 mailing lists. The relevant email threads are linked below; there have been a lot.

SG16: Feedback re: P1885R5: Naming Text Encodings

Naming issues (to be deferred to LEWG):

"mib" vs "mib_enum" vs something else.

Preservation of the "cs" prefix

SG16: P1885: Naming text encodings: Curation and provenance of aliases

Implementation lenience with regard to registered aliases.

Ambiguities between encoding "standards".

SG16: P1885: Naming text encodings: Encodings in the environment versus registered character sets

Latitude for implementations to consider slightly divergent encodings a match for an IANA registered character set.

Latitude for use of encodings such as UTF-8 with wchar_t elements.

Whether the IANA registry constitutes a sufficient source of identified encodings.

SG16: P1885: Naming text encodings: problem+solution re: charsets, octets, and wide encodings

Encoding schemes vs encoding forms and how to map the IANA registry to encodings in C++.

Whether the IANA registry is fit for all the purposes for which it is being employed.

SG16: P1885 polling

Relevance of IANA specified encodings to wide literal encoding.

Tagging of big endian vs little endian.

LEWG: P1885: Text encoding aliases() wording suggestion

Wording recommendations courtesy of Tomasz.

LEWG: P1885: Naming text encodings: R7 wording feedback

Requirements on encoding names.

LEWG: New P1885 revision, LEWG feedback applied

Discussion largely captured in the threads linked above.

The above threads probe fundamental concerns about the IANA registry and the goals that P1885 strives to fulfill. It probably isn't realistic to expect to resolve them all in a single telecon. Given the amount of discussion that has taken place and the possible perspectives offered, I'm no longer confident that we have a shared deep understanding of the design and intent. Specific points I want to cover include the following.

Is the IANA registry sufficient and appropriate for the identification of both the ordinary and wide literal encodings?

How is the IANA registry intended to be applied? Which IANA encoding would be considered a match for each of the following cases?

Wide literal encoding is UTF-16, sizeof(wchar_t) is 2, CHAR_BIT is >= 8, little endian architecture.

Wide literal encoding is UTF-16, sizeof(wchar_t) is 1, CHAR_BIT is >= 16, architecture endianness is irrelevant since code units are a single byte.

Wide literal encoding is UTF-16LE, sizeof(wchar_t) is 1, CHAR_BIT is >= 8, architecture endianness is irrelevant since code units are a single byte.

How are conflicts between the IANA registered encoding names and other names recognized by implementations to be resolved?

Please feel free to suggest other topics.

Tom.

--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16