Date: Fri, 24 Sep 2021 15:58:44 +0200
On Fri, Sep 24, 2021 at 3:24 PM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
> On 24/09/2021 15.16, Corentin wrote:
> >
> >
> > On Fri, Sep 24, 2021 at 2:53 PM Corentin <corentin.jabot_at_[hidden]
> <mailto:corentin.jabot_at_[hidden]>> wrote:
> >
> >
> >
> > On Fri, Sep 24, 2021 at 2:05 PM Jens Maurer <Jens.Maurer_at_[hidden]
> <mailto:Jens.Maurer_at_[hidden]>> wrote:
> >
> > On 24/09/2021 10.17, Corentin wrote:
> > > Jens, Hubert.
> > > Are you satisfied with the added recommended practice
> sections, and other changes?
> >
> > No.
> >
> > Looking at https://isocpp.org/files/papers/D1885R8.pdf <
> https://isocpp.org/files/papers/D1885R8.pdf>
> >
> > "[ Note: The name of each enumerator of the enumeration
> text_encoding::id is derived from
> > the alias of each primary name that begins with ”cs”, as follows"
> >
> > "that begins with" refers to the "primary name".
> >
> > Also, the entity we're talking about here is "encoding", not
> really primary name.
> > Maybe "... derived from the corresponding alias that begins with
> "cs", ..."
> >
> >
> > csUnicode is renamed text_encoding::id::UCS2
> >
> > "is renamed to"
> >
> > or maybe better "is mapped to"
> >
> >
> > Sure
> >
> >
> >
> >
> > I still feel the wording contains insufficient guidance for
> implementers to do
> > the right thing.
> >
> >
> > Consider a little-endian platform with UTF-16 wchar_t. What
> should wide_literal()
> > return? UTF16 or UTF16LE ?
> >
> > Now consider a big-endian platform with UCS-2 wchar_t (because
> they never caught
> > up to recent Unicode extensions). There's only UCS-2, although
> maybe something
> > like UCS2BE might be the much more appropriate choice.
> >
> >
> > Same question for UTF-32 = UCS-4 wchar_t.
> > Should this be UCS4 or UTF32 or UTF32BE/LE?
> >
> >
> >
> > UTF-32 and UCS4 are not exactly the same thing, even if in practice
> they are (UTF-32 makes codepoints over 0x10FFFF invalid),
> > and in practice everybody uses and expects UTF-32.
> >
> > UTF32 is an alias for either UTF32BE or UTF32LE, both are correct.
> > Same for UTF16/UCS2/UTF-16LE/UTF16-BE
> >
> > UCS2BE is completely made up so that helps neither implementer nor
> users
> > We could add some recommendation that UTF16/UTF32 are prefered over
> the names that specify an endianness specifically as this is a Unicode
> specificity, and users will expect UTF-16
> > and I'm certainly willing to do so but... I'm not sure we want to
> describe in the standard every implementation.
> >
> >
> >
> > If I summarize, I think people are asking for a front-matter
> recommended practices
> >
> > We have a sentence that says
> >
> > "How a text_encoding object is determined to be representative of a
> character encoding implemented in the translation or execution environment
> is implementation-defined."
> >
> > We could add beneath
> >
> > Recommended Practices
> >
> > * Implementations should prefer returning UTF-16 over UTF-16BE or
> UTF-16LE
> > * Implementations should prefer returning UTF-32 over UTF-32BE or
> UTF-32LE
> > * Implementations should otherwise not consider registered encodings
> interchangeable (Example: Shift_JIS and Windows-31J denote different
> encoding)
> > * Implementations should not refer to a registered encoding to
> describe another similar yet different non-registered encoding, unless
> there is antecedent to do that on that implementation (Example: Big 5)
> > * Implementations should not refer to an encoding specified as
> single-byte to refer to describe a wide encoding
> >
> > Is that reasonable?
>
> Yes, that sounds like progress to me in the clarity of specification.
> (People might disagree on whether that particular set of recommendations
> is what they want.)
>
> Further questions: UCS2 says "network byte order".
> Do we want to recommend that "network byte order" be ignored
> here and for UCS4, consistent with the preference of UTF-16
> over the byte-order dependent variants?
>
I think you mean that?
> the 2-octet Basic Multilingual Plane, aka Unicode this needs to specify
network byte order: the standard does not specify (it is a 16-bit integer
space)
To the extent that i can parse that sentence, I am not sure it has any
weight
>
> We should review the encoding list again whether there are any other
> wide encodings that have (possibly implied) byte order assumptions.
>
>From what I understand, only the BE/LE versions of unicode specify a byte
order. We can add that to the wording too
>
> Jens
>
> On 24/09/2021 15.16, Corentin wrote:
> >
> >
> > On Fri, Sep 24, 2021 at 2:53 PM Corentin <corentin.jabot_at_[hidden]
> <mailto:corentin.jabot_at_[hidden]>> wrote:
> >
> >
> >
> > On Fri, Sep 24, 2021 at 2:05 PM Jens Maurer <Jens.Maurer_at_[hidden]
> <mailto:Jens.Maurer_at_[hidden]>> wrote:
> >
> > On 24/09/2021 10.17, Corentin wrote:
> > > Jens, Hubert.
> > > Are you satisfied with the added recommended practice
> sections, and other changes?
> >
> > No.
> >
> > Looking at https://isocpp.org/files/papers/D1885R8.pdf <
> https://isocpp.org/files/papers/D1885R8.pdf>
> >
> > "[ Note: The name of each enumerator of the enumeration
> text_encoding::id is derived from
> > the alias of each primary name that begins with ”cs”, as follows"
> >
> > "that begins with" refers to the "primary name".
> >
> > Also, the entity we're talking about here is "encoding", not
> really primary name.
> > Maybe "... derived from the corresponding alias that begins with
> "cs", ..."
> >
> >
> > csUnicode is renamed text_encoding::id::UCS2
> >
> > "is renamed to"
> >
> > or maybe better "is mapped to"
> >
> >
> > Sure
> >
> >
> >
> >
> > I still feel the wording contains insufficient guidance for
> implementers to do
> > the right thing.
> >
> >
> > Consider a little-endian platform with UTF-16 wchar_t. What
> should wide_literal()
> > return? UTF16 or UTF16LE ?
> >
> > Now consider a big-endian platform with UCS-2 wchar_t (because
> they never caught
> > up to recent Unicode extensions). There's only UCS-2, although
> maybe something
> > like UCS2BE might be the much more appropriate choice.
> >
> >
> > Same question for UTF-32 = UCS-4 wchar_t.
> > Should this be UCS4 or UTF32 or UTF32BE/LE?
> >
> >
> >
> > UTF-32 and UCS4 are not exactly the same thing, even if in practice
> they are (UTF-32 makes codepoints over 0x10FFFF invalid),
> > and in practice everybody uses and expects UTF-32.
> >
> > UTF32 is an alias for either UTF32BE or UTF32LE, both are correct.
> > Same for UTF16/UCS2/UTF-16LE/UTF16-BE
> >
> > UCS2BE is completely made up so that helps neither implementer nor
> users
> > We could add some recommendation that UTF16/UTF32 are prefered over
> the names that specify an endianness specifically as this is a Unicode
> specificity, and users will expect UTF-16
> > and I'm certainly willing to do so but... I'm not sure we want to
> describe in the standard every implementation.
> >
> >
> >
> > If I summarize, I think people are asking for a front-matter
> recommended practices
> >
> > We have a sentence that says
> >
> > "How a text_encoding object is determined to be representative of a
> character encoding implemented in the translation or execution environment
> is implementation-defined."
> >
> > We could add beneath
> >
> > Recommended Practices
> >
> > * Implementations should prefer returning UTF-16 over UTF-16BE or
> UTF-16LE
> > * Implementations should prefer returning UTF-32 over UTF-32BE or
> UTF-32LE
> > * Implementations should otherwise not consider registered encodings
> interchangeable (Example: Shift_JIS and Windows-31J denote different
> encoding)
> > * Implementations should not refer to a registered encoding to
> describe another similar yet different non-registered encoding, unless
> there is antecedent to do that on that implementation (Example: Big 5)
> > * Implementations should not refer to an encoding specified as
> single-byte to refer to describe a wide encoding
> >
> > Is that reasonable?
>
> Yes, that sounds like progress to me in the clarity of specification.
> (People might disagree on whether that particular set of recommendations
> is what they want.)
>
> Further questions: UCS2 says "network byte order".
> Do we want to recommend that "network byte order" be ignored
> here and for UCS4, consistent with the preference of UTF-16
> over the byte-order dependent variants?
>
I think you mean that?
> the 2-octet Basic Multilingual Plane, aka Unicode this needs to specify
network byte order: the standard does not specify (it is a 16-bit integer
space)
To the extent that i can parse that sentence, I am not sure it has any
weight
>
> We should review the encoding list again whether there are any other
> wide encodings that have (possibly implied) byte order assumptions.
>
>From what I understand, only the BE/LE versions of unicode specify a byte
order. We can add that to the wording too
>
> Jens
>
Received on 2021-09-24 08:58:59