On Fri, Sep 24, 2021 at 3:24 PM Jens Maurer <Jens.Maurer@gmx.net> wrote:

On 24/09/2021 15.16, Corentin wrote:
>
>
> On Fri, Sep 24, 2021 at 2:53 PM Corentin <corentin.jabot@gmail.com <mailto:corentin.jabot@gmail.com>> wrote:
>
>
>
> On Fri, Sep 24, 2021 at 2:05 PM Jens Maurer <Jens.Maurer@gmx.net <mailto:Jens.Maurer@gmx.net>> wrote:
>
> On 24/09/2021 10.17, Corentin wrote:
> > Jens, Hubert.
> > Are you satisfied with the added recommended practice sections, and other changes?
>
> No.
>
> Looking at https://isocpp.org/files/papers/D1885R8.pdf <https://isocpp.org/files/papers/D1885R8.pdf>
>
> "[ Note: The name of each enumerator of the enumeration text_encoding::id is derived from
> the alias of each primary name that begins with ”cs”, as follows"
>
> "that begins with" refers to the "primary name".
>
> Also, the entity we're talking about here is "encoding", not really primary name.
> Maybe "... derived from the corresponding alias that begins with "cs", ..."
>
>
> csUnicode is renamed text_encoding::id::UCS2
>
> "is renamed to"
>
> or maybe better "is mapped to"
>
>
> Sure
>
>
>
>
> I still feel the wording contains insufficient guidance for implementers to do
> the right thing.
>
>
> Consider a little-endian platform with UTF-16 wchar_t. What should wide_literal()
> return? UTF16 or UTF16LE ?
>
> Now consider a big-endian platform with UCS-2 wchar_t (because they never caught
> up to recent Unicode extensions). There's only UCS-2, although maybe something
> like UCS2BE might be the much more appropriate choice.
>
>
> Same question for UTF-32 = UCS-4 wchar_t.
> Should this be UCS4 or UTF32 or UTF32BE/LE?
>
>
>
> UTF-32 and UCS4 are not exactly the same thing, even if in practice they are (UTF-32 makes codepoints over 0x10FFFF invalid),
> and in practice everybody uses and expects UTF-32.
>
> UTF32 is an alias for either UTF32BE or UTF32LE, both are correct.
> Same for UTF16/UCS2/UTF-16LE/UTF16-BE
>
> UCS2BE is completely made up so that helps neither implementer nor users
> We could add some recommendation that UTF16/UTF32 are prefered over the names that specify an endianness specifically as this is a Unicode specificity, and users will expect UTF-16
> and I'm certainly willing to do so but... I'm not sure we want to describe in the standard every implementation.
>
>
>
> If I summarize, I think people are asking for a front-matter recommended practices
>
> We have a sentence that says
>
> "How a text_encoding object is determined to be representative of a character encoding implemented in the translation or execution environment is implementation-defined."
>
> We could add beneath
>
> Recommended Practices
>
> * Implementations should prefer returning UTF-16 over UTF-16BE or UTF-16LE
> * Implementations should prefer returning UTF-32 over UTF-32BE or UTF-32LE
> * Implementations should otherwise not consider registered encodings interchangeable (Example: Shift_JIS and Windows-31J denote different encoding)
> * Implementations should not refer to a registered encoding to describe another similar yet different non-registered encoding, unless there is antecedent to do that on that implementation (Example: Big 5)
> * Implementations should not refer to an encoding specified as single-byte to refer to describe a wide encoding
>
> Is that reasonable?

Yes, that sounds like progress to me in the clarity of specification.
(People might disagree on whether that particular set of recommendations
is what they want.)

Further questions: UCS2 says "network byte order".
Do we want to recommend that "network byte order" be ignored
here and for UCS4, consistent with the preference of UTF-16
over the byte-order dependent variants?

I think you mean that?

> the 2-octet Basic Multilingual Plane, aka Unicode this needs to specify network byte order: the standard does not specify (it is a 16-bit integer space)

To the extent that i can parse that sentence, I am not sure it has any weight

We should review the encoding list again whether there are any other
wide encodings that have (possibly implied) byte order assumptions.

From what I understand, only the BE/LE versions of unicode specify a byte order. We can add that to the wording too

Jens