Date: Fri, 24 Sep 2021 16:28:01 +0200
On Fri, Sep 24, 2021 at 3:58 PM Corentin <corentin.jabot_at_[hidden]> wrote:
>
>
> On Fri, Sep 24, 2021 at 3:24 PM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
>
>> On 24/09/2021 15.16, Corentin wrote:
>> >
>> >
>> > On Fri, Sep 24, 2021 at 2:53 PM Corentin <corentin.jabot_at_[hidden]
>> <mailto:corentin.jabot_at_[hidden]>> wrote:
>> >
>> >
>> >
>> > On Fri, Sep 24, 2021 at 2:05 PM Jens Maurer <Jens.Maurer_at_[hidden]
>> <mailto:Jens.Maurer_at_[hidden]>> wrote:
>> >
>> > On 24/09/2021 10.17, Corentin wrote:
>> > > Jens, Hubert.
>> > > Are you satisfied with the added recommended practice
>> sections, and other changes?
>> >
>> > No.
>> >
>> > Looking at https://isocpp.org/files/papers/D1885R8.pdf <
>> https://isocpp.org/files/papers/D1885R8.pdf>
>> >
>> > "[ Note: The name of each enumerator of the enumeration
>> text_encoding::id is derived from
>> > the alias of each primary name that begins with ”cs”, as
>> follows"
>> >
>> > "that begins with" refers to the "primary name".
>> >
>> > Also, the entity we're talking about here is "encoding", not
>> really primary name.
>> > Maybe "... derived from the corresponding alias that begins
>> with "cs", ..."
>> >
>> >
>> > csUnicode is renamed text_encoding::id::UCS2
>> >
>> > "is renamed to"
>> >
>> > or maybe better "is mapped to"
>> >
>> >
>> > Sure
>> >
>> >
>> >
>> >
>> > I still feel the wording contains insufficient guidance for
>> implementers to do
>> > the right thing.
>> >
>> >
>> > Consider a little-endian platform with UTF-16 wchar_t. What
>> should wide_literal()
>> > return? UTF16 or UTF16LE ?
>> >
>> > Now consider a big-endian platform with UCS-2 wchar_t (because
>> they never caught
>> > up to recent Unicode extensions). There's only UCS-2, although
>> maybe something
>> > like UCS2BE might be the much more appropriate choice.
>> >
>> >
>> > Same question for UTF-32 = UCS-4 wchar_t.
>> > Should this be UCS4 or UTF32 or UTF32BE/LE?
>> >
>> >
>> >
>> > UTF-32 and UCS4 are not exactly the same thing, even if in practice
>> they are (UTF-32 makes codepoints over 0x10FFFF invalid),
>> > and in practice everybody uses and expects UTF-32.
>> >
>> > UTF32 is an alias for either UTF32BE or UTF32LE, both are correct.
>> > Same for UTF16/UCS2/UTF-16LE/UTF16-BE
>> >
>> > UCS2BE is completely made up so that helps neither implementer nor
>> users
>> > We could add some recommendation that UTF16/UTF32 are prefered over
>> the names that specify an endianness specifically as this is a Unicode
>> specificity, and users will expect UTF-16
>> > and I'm certainly willing to do so but... I'm not sure we want to
>> describe in the standard every implementation.
>> >
>> >
>> >
>> > If I summarize, I think people are asking for a front-matter
>> recommended practices
>> >
>> > We have a sentence that says
>> >
>> > "How a text_encoding object is determined to be representative of a
>> character encoding implemented in the translation or execution environment
>> is implementation-defined."
>> >
>> > We could add beneath
>> >
>> > Recommended Practices
>> >
>> > * Implementations should prefer returning UTF-16 over UTF-16BE or
>> UTF-16LE
>> > * Implementations should prefer returning UTF-32 over UTF-32BE or
>> UTF-32LE
>> > * Implementations should otherwise not consider registered encodings
>> interchangeable (Example: Shift_JIS and Windows-31J denote different
>> encoding)
>> > * Implementations should not refer to a registered encoding to
>> describe another similar yet different non-registered encoding, unless
>> there is antecedent to do that on that implementation (Example: Big 5)
>> > * Implementations should not refer to an encoding specified as
>> single-byte to refer to describe a wide encoding
>> >
>> > Is that reasonable?
>>
>> Yes, that sounds like progress to me in the clarity of specification.
>> (People might disagree on whether that particular set of recommendations
>> is what they want.)
>>
>> Further questions: UCS2 says "network byte order".
>> Do we want to recommend that "network byte order" be ignored
>> here and for UCS4, consistent with the preference of UTF-16
>> over the byte-order dependent variants?
>>
>
>
> I think you mean that?
>
> > the 2-octet Basic Multilingual Plane, aka Unicode this needs to specify
> network byte order: the standard does not specify (it is a 16-bit integer
> space)
>
> To the extent that i can parse that sentence, I am not sure it has any
> weight
>
>
>>
>> We should review the encoding list again whether there are any other
>> wide encodings that have (possibly implied) byte order assumptions.
>>
>
> From what I understand, only the BE/LE versions of unicode specify a byte
> order. We can add that to the wording too
>
Done https://isocpp.org/files/papers/D1885R8.pdf
>
>
>>
>> Jens
>>
>
>
>
> On Fri, Sep 24, 2021 at 3:24 PM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
>
>> On 24/09/2021 15.16, Corentin wrote:
>> >
>> >
>> > On Fri, Sep 24, 2021 at 2:53 PM Corentin <corentin.jabot_at_[hidden]
>> <mailto:corentin.jabot_at_[hidden]>> wrote:
>> >
>> >
>> >
>> > On Fri, Sep 24, 2021 at 2:05 PM Jens Maurer <Jens.Maurer_at_[hidden]
>> <mailto:Jens.Maurer_at_[hidden]>> wrote:
>> >
>> > On 24/09/2021 10.17, Corentin wrote:
>> > > Jens, Hubert.
>> > > Are you satisfied with the added recommended practice
>> sections, and other changes?
>> >
>> > No.
>> >
>> > Looking at https://isocpp.org/files/papers/D1885R8.pdf <
>> https://isocpp.org/files/papers/D1885R8.pdf>
>> >
>> > "[ Note: The name of each enumerator of the enumeration
>> text_encoding::id is derived from
>> > the alias of each primary name that begins with ”cs”, as
>> follows"
>> >
>> > "that begins with" refers to the "primary name".
>> >
>> > Also, the entity we're talking about here is "encoding", not
>> really primary name.
>> > Maybe "... derived from the corresponding alias that begins
>> with "cs", ..."
>> >
>> >
>> > csUnicode is renamed text_encoding::id::UCS2
>> >
>> > "is renamed to"
>> >
>> > or maybe better "is mapped to"
>> >
>> >
>> > Sure
>> >
>> >
>> >
>> >
>> > I still feel the wording contains insufficient guidance for
>> implementers to do
>> > the right thing.
>> >
>> >
>> > Consider a little-endian platform with UTF-16 wchar_t. What
>> should wide_literal()
>> > return? UTF16 or UTF16LE ?
>> >
>> > Now consider a big-endian platform with UCS-2 wchar_t (because
>> they never caught
>> > up to recent Unicode extensions). There's only UCS-2, although
>> maybe something
>> > like UCS2BE might be the much more appropriate choice.
>> >
>> >
>> > Same question for UTF-32 = UCS-4 wchar_t.
>> > Should this be UCS4 or UTF32 or UTF32BE/LE?
>> >
>> >
>> >
>> > UTF-32 and UCS4 are not exactly the same thing, even if in practice
>> they are (UTF-32 makes codepoints over 0x10FFFF invalid),
>> > and in practice everybody uses and expects UTF-32.
>> >
>> > UTF32 is an alias for either UTF32BE or UTF32LE, both are correct.
>> > Same for UTF16/UCS2/UTF-16LE/UTF16-BE
>> >
>> > UCS2BE is completely made up so that helps neither implementer nor
>> users
>> > We could add some recommendation that UTF16/UTF32 are prefered over
>> the names that specify an endianness specifically as this is a Unicode
>> specificity, and users will expect UTF-16
>> > and I'm certainly willing to do so but... I'm not sure we want to
>> describe in the standard every implementation.
>> >
>> >
>> >
>> > If I summarize, I think people are asking for a front-matter
>> recommended practices
>> >
>> > We have a sentence that says
>> >
>> > "How a text_encoding object is determined to be representative of a
>> character encoding implemented in the translation or execution environment
>> is implementation-defined."
>> >
>> > We could add beneath
>> >
>> > Recommended Practices
>> >
>> > * Implementations should prefer returning UTF-16 over UTF-16BE or
>> UTF-16LE
>> > * Implementations should prefer returning UTF-32 over UTF-32BE or
>> UTF-32LE
>> > * Implementations should otherwise not consider registered encodings
>> interchangeable (Example: Shift_JIS and Windows-31J denote different
>> encoding)
>> > * Implementations should not refer to a registered encoding to
>> describe another similar yet different non-registered encoding, unless
>> there is antecedent to do that on that implementation (Example: Big 5)
>> > * Implementations should not refer to an encoding specified as
>> single-byte to refer to describe a wide encoding
>> >
>> > Is that reasonable?
>>
>> Yes, that sounds like progress to me in the clarity of specification.
>> (People might disagree on whether that particular set of recommendations
>> is what they want.)
>>
>> Further questions: UCS2 says "network byte order".
>> Do we want to recommend that "network byte order" be ignored
>> here and for UCS4, consistent with the preference of UTF-16
>> over the byte-order dependent variants?
>>
>
>
> I think you mean that?
>
> > the 2-octet Basic Multilingual Plane, aka Unicode this needs to specify
> network byte order: the standard does not specify (it is a 16-bit integer
> space)
>
> To the extent that i can parse that sentence, I am not sure it has any
> weight
>
>
>>
>> We should review the encoding list again whether there are any other
>> wide encodings that have (possibly implied) byte order assumptions.
>>
>
> From what I understand, only the BE/LE versions of unicode specify a byte
> order. We can add that to the wording too
>
Done https://isocpp.org/files/papers/D1885R8.pdf
>
>
>>
>> Jens
>>
>
Received on 2021-09-24 09:28:14