On Fri, Sep 24, 2021 at 3:24 PM Jens Maurer <Jens.Maurer@gmx.net> wrote:
On 24/09/2021 15.16, Corentin wrote:
>
>
> On Fri, Sep 24, 2021 at 2:53 PM Corentin <corentin.jabot@gmail.com <mailto:corentin.jabot@gmail.com>> wrote:
>
>
>
>     On Fri, Sep 24, 2021 at 2:05 PM Jens Maurer <Jens.Maurer@gmx.net <mailto:Jens.Maurer@gmx.net>> wrote:
>
>         On 24/09/2021 10.17, Corentin wrote:
>         > Jens, Hubert.
>         > Are you satisfied with the added recommended practice sections, and other changes?
>
>         No.
>
>         Looking at https://isocpp.org/files/papers/D1885R8.pdf <https://isocpp.org/files/papers/D1885R8.pdf>
>
>         "[ Note: The name of each enumerator of the enumeration text_encoding::id is derived from
>         the alias of each primary name that begins with ”cs”, as follows"
>
>         "that begins with" refers to the "primary name".
>
>         Also, the entity we're talking about here is "encoding", not really primary name.
>         Maybe "... derived from the corresponding alias that begins with "cs", ..."
>
>
>         csUnicode is renamed text_encoding::id::UCS2
>
>         "is renamed to"
>
>         or maybe better "is mapped to"
>
>
>     Sure
>      
>
>
>
>         I still feel the wording contains insufficient guidance for implementers to do
>         the right thing.
>
>
>         Consider a little-endian platform with UTF-16 wchar_t.  What should wide_literal()
>         return?  UTF16 or UTF16LE ?
>
>         Now consider a big-endian platform with UCS-2 wchar_t (because they never caught
>         up to recent Unicode extensions).  There's only UCS-2, although maybe something
>         like UCS2BE might be the much more appropriate choice. 
>
>
>         Same question for UTF-32 = UCS-4 wchar_t.
>         Should this be UCS4 or UTF32 or UTF32BE/LE?
>
>
>
>     UTF-32 and UCS4 are not exactly the same thing, even if in practice they are (UTF-32 makes codepoints over 0x10FFFF invalid),
>     and in practice everybody uses and expects UTF-32.
>
>     UTF32 is an alias for either UTF32BE or UTF32LE, both are correct.
>     Same for UTF16/UCS2/UTF-16LE/UTF16-BE
>
>     UCS2BE is completely made up so that helps neither implementer nor users
>     We could add some recommendation that UTF16/UTF32 are prefered over the names that specify an endianness specifically as this is a Unicode specificity, and users will expect UTF-16
>     and I'm certainly willing to do so but... I'm not sure we want to describe in the standard every implementation.
>
>
>
> If I summarize, I think people are asking for a front-matter recommended practices
>
> We have a sentence that says 
>
> "How a text_encoding object is determined to be representative of a character encoding implemented in the translation or execution environment is implementation-defined."
>
> We could add beneath
>
> Recommended Practices
>
>   * Implementations should prefer returning UTF-16 over UTF-16BE or UTF-16LE
>   * Implementations should prefer returning UTF-32 over UTF-32BE or UTF-32LE
>   * Implementations should otherwise not consider registered encodings interchangeable (Example: Shift_JIS and Windows-31J denote different encoding)
>   * Implementations should not refer to a registered encoding to describe another similar yet different non-registered encoding, unless there is antecedent to do that on that implementation (Example: Big 5)
>   * Implementations should not refer to an encoding specified as single-byte to refer to describe a wide encoding 
>
> Is that reasonable?

Yes, that sounds like progress to me in the clarity of specification.
(People might disagree on whether that particular set of recommendations
is what they want.)

Further questions:  UCS2 says "network byte order".
Do we want to recommend that "network byte order" be ignored
here and for UCS4, consistent with the preference of UTF-16
over the byte-order dependent variants?


I think you mean that?

>  the 2-octet Basic Multilingual Plane, aka Unicode this needs to specify network byte order: the standard does not specify (it is a 16-bit integer space)

To the extent that i can parse that sentence, I am not sure it has any weight
 

We should review the encoding list again whether there are any other
wide encodings that have (possibly implied) byte order assumptions.

From what I understand, only the BE/LE versions of unicode specify a byte order. We can add that to the wording too
 

Jens