On Fri, Sep 24, 2021 at 2:53 PM Corentin <corentin.jabot@gmail.com> wrote:

On Fri, Sep 24, 2021 at 2:05 PM Jens Maurer <Jens.Maurer@gmx.net> wrote:
On 24/09/2021 10.17, Corentin wrote:
> Jens, Hubert.
> Are you satisfied with the added recommended practice sections, and other changes?

No.

Looking at https://isocpp.org/files/papers/D1885R8.pdf

"[ Note: The name of each enumerator of the enumeration text_encoding::id is derived from
the alias of each primary name that begins with ”cs”, as follows"

"that begins with" refers to the "primary name".

Also, the entity we're talking about here is "encoding", not really primary name.
Maybe "... derived from the corresponding alias that begins with "cs", ..."

csUnicode is renamed text_encoding::id::UCS2

"is renamed to"

or maybe better "is mapped to"

Sure

I still feel the wording contains insufficient guidance for implementers to do
the right thing.

Consider a little-endian platform with UTF-16 wchar_t. What should wide_literal()
return? UTF16 or UTF16LE ?

Now consider a big-endian platform with UCS-2 wchar_t (because they never caught
up to recent Unicode extensions). There's only UCS-2, although maybe something
like UCS2BE might be the much more appropriate choice.

Same question for UTF-32 = UCS-4 wchar_t.
Should this be UCS4 or UTF32 or UTF32BE/LE?

UTF-32 and UCS4 are not exactly the same thing, even if in practice they are (UTF-32 makes codepoints over 0x10FFFF invalid),
and in practice everybody uses and expects UTF-32.

UTF32 is an alias for either UTF32BE or UTF32LE, both are correct.
Same for UTF16/UCS2/UTF-16LE/UTF16-BE

UCS2BE is completely made up so that helps neither implementer nor users
We could add some recommendation that UTF16/UTF32 are prefered over the names that specify an endianness specifically as this is a Unicode specificity, and users will expect UTF-16
and I'm certainly willing to do so but... I'm not sure we want to describe in the standard every implementation.

If I summarize, I think people are asking for a front-matter recommended practices

We have a sentence that says

"How a text_encoding object is determined to be representative of a character encoding implemented in the translation or execution environment is implementation-defined."

We could add beneath

Recommended Practices

Implementations should prefer returning UTF-16 over UTF-16BE or UTF-16LE
Implementations should prefer returning UTF-32 over UTF-32BE or UTF-32LE
Implementations should otherwise not consider registered encodings interchangeable (Example: Shift_JIS and Windows-31J denote different encoding)
Implementations should not refer to a registered encoding to describe another similar yet different non-registered encoding, unless there is antecedent to do that on that implementation (Example: Big 5)
Implementations should not refer to an encoding specified as single-byte to refer to describe a wide encoding

Is that reasonable?

Jens