Date: Thu, 23 Sep 2021 11:48:31 +0200
Thank you for your feedback Jens,
https://isocpp.org/files/papers/D1885R8.pdf
I hope the addition of "recommended practice" sections will resolve the
questions both Hubert and you still have.
On Thu, Sep 23, 2021 at 8:16 AM Jens Maurer via SG16 <sg16_at_[hidden]>
wrote:
> On 23/09/2021 05.16, Hubert Tong via SG16 wrote:
> > On Wed, Sep 22, 2021 at 5:33 PM Tom Honermann via SG16 <
> sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
> >
> > On 9/22/21 5:18 PM, Corentin via SG16 wrote:
> >> Hello,
> >>
> >> I forgot to mention that Bryce asked on the reflector whether we
> want to forward P1885 to electronic polling
> https://lists.isocpp.org/lib-ext/2021/09/20049.php <
> https://lists.isocpp.org/lib-ext/2021/09/20049.php>
> >>
> >> Hubert had some great feedback, which I hope to have addressed
> >> https://isocpp.org/files/papers/D1885R8.pdf <
> https://isocpp.org/files/papers/D1885R8.pdf>
> >>
> >> I'd like to see this paper forwarded, I would love it if either you
> could indicate your support or let me know how I could increase consensus.
> >
> > I assume by forwarded, you mean forwarded from LEWG to electronic
> polling.
> >
> > I'm not aware of anything that would change the SG16 consensus for
> it, but I haven't caught up on the recent discussions on the LEWG mailing
> list yet either. I'll get caught up and follow up as necessary.
> >
> > I think there have been a few surprises:
> > Under the author's preferred interpretation of the charsets represented
> by the IANA Character Set Registry, few of the registered encodings are
> wide encodings. The wording is designed towards high implementation
> freedom, so I am not sure how much of the author's intent is going to be
> apparent to implementers (especially if individuals not directly
> participating in the threads of discussion happen to be the people who end
> up doing the implementation).
>
>
> > Also, (this is new information to me and I expect to most people as
> well) the paper's prose points to GCC's -fwide-exec-charset option, which
> really only works if the option specifies a correctly-sized wide encoding
> that iconv recognizes.
> >
> > Observe:
> > $ gcc -fwide-exec-charset=ISO8859-1 -fsyntax-only -xc++ -<<<$'extern
> char x[L\'0\'], x[0x30];'
> > <stdin>:1:28: error: conflicting declaration 'char x [48]'
> > <stdin>:1:13: note: previous declaration as 'char x [805306368]'
> >
> > So, insofar as the example in the prose is concerned, there would need
> to be an iconv name for the appropriately-sized wide EBCDIC encoding.
>
> I am confused.
>
> The prose text says:
>
> "Note: Because they have different code units sizes, narrow and wide
> strings have
> different encodings."
>
> I'd thus expect different enum ids for wide and narrow strings in the list,
> but the use of
>
> g++ -fwide-exec-charset=EBCDIC-US
> [...]
> Wide Literal Encoding: EBCDIC-US (iana mib: 2078)
>
> in the example is contrary to that statement, assuming that
> EBCDIC-US is generally an 8-bit encoding, not a wide encoding.
>
>
Yes, this example, while conforming would not be a recommended practice,
I did modify it.
>
> Then, we have
>
> "Identifying Encodings
>
> [...]
>
> Fortunately there exist a database of registered encoding
> covering almost all encodings supported by operating systems and
> compilers. This database
> is maintained by IANA through a process described by [rfc2978].
> This database lists over 250 registered character sets and for each:"
>
> This sentence moves from the goal of talking about "encodings" to
> the term "character set" without any further explanation.
> This should describe that IANA / the RFC calls a "character set" what
> we believe is an "encoding".
>
An encoding maps directly to a character set, the reverse is not true.
See further down,
>
>
> Regarding the wording:
>
> "registered-character-set" is not a grammar term (fine), thus should
> not be hyphenated (we can have defined English phrases, not just defined
> single
> words, in the standard). Make sure to italicize a word only on its
> definition, not everywhere.
>
Done.
In addition I renamed it to registered character set to avoid confusion (
I agree the term is confusing)
The choice of words "registered-character-set" proliferates the
> misunderstanding
> that the IANA registry talks about character sets; it talks about
> encodings.
> The C++ wording should thus use "encoding" in its description; possibly
> with
> a note explaining that IANA / the RFC mistakenly calls them character sets.
>
In addition to renaming the term, a note was added.
>
> "IANA Character Sets registry" needs a reference to the RFC
> establishing that registry. I think we can get away with adding
> that reference to the bibliography (not the normative references).
>
There was a reference already - I did add a date.
The reference is what I believe to be the primary reference of interest, it
itself refers to a few more documents.
https://www.iana.org/assignments/character-sets/character-sets.xhtml
Do you think the standard needs to refer to everything directly?
Hubert observed a few month ago that IANA took over the original RFCs
>
> "The set of known registered-character-set"
> Make the "registered character set" plural: "the set of oranges", not
> "the set of orange"
>
Done
> registery -> registry
>
Done
> "implementation-defined snapshot" conflicts with
> "Each known registered-character-set is identified by an enumerator in
> text_encoding::id"
>
> It's unclear whether an implementation is supposed to add enumerators on
> its own,
> or not. (Personally, I think due to the low change frequency of the list,
> we should just maintain the master copy of the enumerators in the standard,
> which would also allow us to fix the typos and inconsistencies.
>
We have been over this a few times, Hubert was adamant snapshot was useful.
I did remove it, but added a date to the bibliography reference.
> Oh, we do fix some of the typos. Can we consistently spell "Windows"
> with an uppercase "W", please?)
>
This has been discussed.
Do you want SG-16/LEWG to reopen those discussions?
I would rather not
> "primary-name": not a grammar term, remove hyphen
> (I don't particularly care whether the RFC hyphenates it; we seem
> to have a stand-alone definition here.)
>
Done
>
> 0 -> zero
>
Done
>
> "Its primary name" has ambiguous antecedent.
> same for "Its set of aliases"
>
Fixed
>
> "No two registered-character-set": change to plural: "no two oranges are
> the same"
>
Fixed
>
> "How a text_encoding object": needs monospace font for text_encoding
>
> "if two strings [...] are equal"
> -> "if the two strings a and b [...] are equal"
>
> add comma after "left-to-right"
>
> "0 characters" -> presumably, you mean '0' (digit zero) characters?
> Then use ''. Otherwise, use "null character" or "U+0000".
>
> "optionally followed by code units outside of the
> ranges [a-z], [A-Z], [0-9]"
>
> What? code units are not in a range designated by regex character ranges.
> You were talking about characters (I presume code points) all along, so
> maybe you should avoid "code unit" entirely here.
>
Now reads
Let bool COMP_NAME(string_view a, string_view b) be a function that returns
true if the two
strings a and b encoded in the literal encoding are equal ignoring, from
left-to-right,
• all elements not in the basic character set,
• all elements which are not digits or letters [character.seq.general],
• character case, and
• any sequence of one or more ’0’ character not immediately preceded by a
sequence consisting of a digit in the range [1-9] optionally followed by
one or more element which
are not digits or letters.
[ Note: This comparison is identical to the ”Charset Alias Matching”
algorithm described in the Unicode Technical Standard 22. — end note ]
> Who came up with that comparison algorithm? Probably needs a
> cross-reference
> so nobody blames us.
"string literal character encoding"
>
See above
> That doesn't exist. Do you want to talk about a literal encoding?
> Or something else?
>
Sure
>
> Do we expect implementations to accept text_encoding("csIBBM904") and
> interpret it as "IBM904"?
>
Yes, that has been discussed with Hubert, renaming aliases would defeat the
purposes of aliases
> The postcondition for "text_encoding(id mib)" seems to imply that a
> name lookup must be done here. I thought we didn't want that.
>
No, we added the is_ templates functions to avoid the lookup
>
> "is a ntbs" -> "is an NTBS"
>
> "range [name(), strlen(name())+1]"
>
> use monospace font
>
> "implementation-defined (wide) character encoding of the environment"
>
> No italics for "character encoding". Also, I don't know what that
> term means. We can talk about "execution (wide) character set"
> and its encoding, but everything else seems undefined for now.
This entire paragraph now reads
Returns an implementation-defined value representing the wide encoding of
the envi-
ronment.
On a POSIX implementation, this is the wide encoding associated with the
POSIX locale
denoted by the empty string "".
[ Note: This function is not affected by calls to setlocale. It is
unspecified whether this
function is affected by changes to environment variables during the
lifetime of the
program. The encoding represented by the returned value of this function,
if any, is not
required to meet the preconditions of all the standard wide character
functions. — end
note ]
Recommended practice: Implementations should return a value that represents
an en-
coding whose code unit size matches the size of a single wchar_t.
> [text.encoding.aliases]
> "model" is in monospace font once; it shouldn't be.
>
Fixed
>
> Regarding the editor's note: We should show a reasonably complete
> class definition in the standard.
>
Tomazs had different opinions, I'm sure LWG will tell me what to do
>
> static consteval text_encoding literal();
> static consteval text_encoding wide_literal();
>
> Use "literal encoding" terms and add a cross-reference
> to the core language section where they're defined (by my paper).
>
Done
>
> The normative text needs more notes/examples to show how the
> seemingly narrow-only IANA "charsets" are intended to map to
> the return values for wide encodings.
>
See above
>
>
> [text.encoding.comp]
> constexpr bool operator==(const text_encoding & a, const text_encoding &
> b) const noexcept;
>
> The two "return" mentions can go.
>
Done
>
>
> "Remarks: This operator induces an equivalence relation on its arguments
> if and only if i != id::other is true."
>
> So, I'm required NOT to offer an equivalence relation if i == id::other?
> That doesn't work with the specific "Returns" clause.
> Oh, and I don't know what an equivalence relation is on things
> of different types. I think the "remarks" should just go.
>
Tomasz was adamant this was useful. I'm sure LWG will tell me what to do.
>
> "narrow strings" don't exist in the standard. Do you mean
"strings whose elements are of type char"?
> Same for "wide strings".
>
Now reads "The wide execution encoding associated with the locale..." which
I think matches library wording.
>
> Jens
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
https://isocpp.org/files/papers/D1885R8.pdf
I hope the addition of "recommended practice" sections will resolve the
questions both Hubert and you still have.
On Thu, Sep 23, 2021 at 8:16 AM Jens Maurer via SG16 <sg16_at_[hidden]>
wrote:
> On 23/09/2021 05.16, Hubert Tong via SG16 wrote:
> > On Wed, Sep 22, 2021 at 5:33 PM Tom Honermann via SG16 <
> sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
> >
> > On 9/22/21 5:18 PM, Corentin via SG16 wrote:
> >> Hello,
> >>
> >> I forgot to mention that Bryce asked on the reflector whether we
> want to forward P1885 to electronic polling
> https://lists.isocpp.org/lib-ext/2021/09/20049.php <
> https://lists.isocpp.org/lib-ext/2021/09/20049.php>
> >>
> >> Hubert had some great feedback, which I hope to have addressed
> >> https://isocpp.org/files/papers/D1885R8.pdf <
> https://isocpp.org/files/papers/D1885R8.pdf>
> >>
> >> I'd like to see this paper forwarded, I would love it if either you
> could indicate your support or let me know how I could increase consensus.
> >
> > I assume by forwarded, you mean forwarded from LEWG to electronic
> polling.
> >
> > I'm not aware of anything that would change the SG16 consensus for
> it, but I haven't caught up on the recent discussions on the LEWG mailing
> list yet either. I'll get caught up and follow up as necessary.
> >
> > I think there have been a few surprises:
> > Under the author's preferred interpretation of the charsets represented
> by the IANA Character Set Registry, few of the registered encodings are
> wide encodings. The wording is designed towards high implementation
> freedom, so I am not sure how much of the author's intent is going to be
> apparent to implementers (especially if individuals not directly
> participating in the threads of discussion happen to be the people who end
> up doing the implementation).
>
>
> > Also, (this is new information to me and I expect to most people as
> well) the paper's prose points to GCC's -fwide-exec-charset option, which
> really only works if the option specifies a correctly-sized wide encoding
> that iconv recognizes.
> >
> > Observe:
> > $ gcc -fwide-exec-charset=ISO8859-1 -fsyntax-only -xc++ -<<<$'extern
> char x[L\'0\'], x[0x30];'
> > <stdin>:1:28: error: conflicting declaration 'char x [48]'
> > <stdin>:1:13: note: previous declaration as 'char x [805306368]'
> >
> > So, insofar as the example in the prose is concerned, there would need
> to be an iconv name for the appropriately-sized wide EBCDIC encoding.
>
> I am confused.
>
> The prose text says:
>
> "Note: Because they have different code units sizes, narrow and wide
> strings have
> different encodings."
>
> I'd thus expect different enum ids for wide and narrow strings in the list,
> but the use of
>
> g++ -fwide-exec-charset=EBCDIC-US
> [...]
> Wide Literal Encoding: EBCDIC-US (iana mib: 2078)
>
> in the example is contrary to that statement, assuming that
> EBCDIC-US is generally an 8-bit encoding, not a wide encoding.
>
>
Yes, this example, while conforming would not be a recommended practice,
I did modify it.
>
> Then, we have
>
> "Identifying Encodings
>
> [...]
>
> Fortunately there exist a database of registered encoding
> covering almost all encodings supported by operating systems and
> compilers. This database
> is maintained by IANA through a process described by [rfc2978].
> This database lists over 250 registered character sets and for each:"
>
> This sentence moves from the goal of talking about "encodings" to
> the term "character set" without any further explanation.
> This should describe that IANA / the RFC calls a "character set" what
> we believe is an "encoding".
>
An encoding maps directly to a character set, the reverse is not true.
See further down,
>
>
> Regarding the wording:
>
> "registered-character-set" is not a grammar term (fine), thus should
> not be hyphenated (we can have defined English phrases, not just defined
> single
> words, in the standard). Make sure to italicize a word only on its
> definition, not everywhere.
>
Done.
In addition I renamed it to registered character set to avoid confusion (
I agree the term is confusing)
The choice of words "registered-character-set" proliferates the
> misunderstanding
> that the IANA registry talks about character sets; it talks about
> encodings.
> The C++ wording should thus use "encoding" in its description; possibly
> with
> a note explaining that IANA / the RFC mistakenly calls them character sets.
>
In addition to renaming the term, a note was added.
>
> "IANA Character Sets registry" needs a reference to the RFC
> establishing that registry. I think we can get away with adding
> that reference to the bibliography (not the normative references).
>
There was a reference already - I did add a date.
The reference is what I believe to be the primary reference of interest, it
itself refers to a few more documents.
https://www.iana.org/assignments/character-sets/character-sets.xhtml
Do you think the standard needs to refer to everything directly?
Hubert observed a few month ago that IANA took over the original RFCs
>
> "The set of known registered-character-set"
> Make the "registered character set" plural: "the set of oranges", not
> "the set of orange"
>
Done
> registery -> registry
>
Done
> "implementation-defined snapshot" conflicts with
> "Each known registered-character-set is identified by an enumerator in
> text_encoding::id"
>
> It's unclear whether an implementation is supposed to add enumerators on
> its own,
> or not. (Personally, I think due to the low change frequency of the list,
> we should just maintain the master copy of the enumerators in the standard,
> which would also allow us to fix the typos and inconsistencies.
>
We have been over this a few times, Hubert was adamant snapshot was useful.
I did remove it, but added a date to the bibliography reference.
> Oh, we do fix some of the typos. Can we consistently spell "Windows"
> with an uppercase "W", please?)
>
This has been discussed.
Do you want SG-16/LEWG to reopen those discussions?
I would rather not
> "primary-name": not a grammar term, remove hyphen
> (I don't particularly care whether the RFC hyphenates it; we seem
> to have a stand-alone definition here.)
>
Done
>
> 0 -> zero
>
Done
>
> "Its primary name" has ambiguous antecedent.
> same for "Its set of aliases"
>
Fixed
>
> "No two registered-character-set": change to plural: "no two oranges are
> the same"
>
Fixed
>
> "How a text_encoding object": needs monospace font for text_encoding
>
> "if two strings [...] are equal"
> -> "if the two strings a and b [...] are equal"
>
> add comma after "left-to-right"
>
> "0 characters" -> presumably, you mean '0' (digit zero) characters?
> Then use ''. Otherwise, use "null character" or "U+0000".
>
> "optionally followed by code units outside of the
> ranges [a-z], [A-Z], [0-9]"
>
> What? code units are not in a range designated by regex character ranges.
> You were talking about characters (I presume code points) all along, so
> maybe you should avoid "code unit" entirely here.
>
Now reads
Let bool COMP_NAME(string_view a, string_view b) be a function that returns
true if the two
strings a and b encoded in the literal encoding are equal ignoring, from
left-to-right,
• all elements not in the basic character set,
• all elements which are not digits or letters [character.seq.general],
• character case, and
• any sequence of one or more ’0’ character not immediately preceded by a
sequence consisting of a digit in the range [1-9] optionally followed by
one or more element which
are not digits or letters.
[ Note: This comparison is identical to the ”Charset Alias Matching”
algorithm described in the Unicode Technical Standard 22. — end note ]
> Who came up with that comparison algorithm? Probably needs a
> cross-reference
> so nobody blames us.
"string literal character encoding"
>
See above
> That doesn't exist. Do you want to talk about a literal encoding?
> Or something else?
>
Sure
>
> Do we expect implementations to accept text_encoding("csIBBM904") and
> interpret it as "IBM904"?
>
Yes, that has been discussed with Hubert, renaming aliases would defeat the
purposes of aliases
> The postcondition for "text_encoding(id mib)" seems to imply that a
> name lookup must be done here. I thought we didn't want that.
>
No, we added the is_ templates functions to avoid the lookup
>
> "is a ntbs" -> "is an NTBS"
>
> "range [name(), strlen(name())+1]"
>
> use monospace font
>
> "implementation-defined (wide) character encoding of the environment"
>
> No italics for "character encoding". Also, I don't know what that
> term means. We can talk about "execution (wide) character set"
> and its encoding, but everything else seems undefined for now.
This entire paragraph now reads
Returns an implementation-defined value representing the wide encoding of
the envi-
ronment.
On a POSIX implementation, this is the wide encoding associated with the
POSIX locale
denoted by the empty string "".
[ Note: This function is not affected by calls to setlocale. It is
unspecified whether this
function is affected by changes to environment variables during the
lifetime of the
program. The encoding represented by the returned value of this function,
if any, is not
required to meet the preconditions of all the standard wide character
functions. — end
note ]
Recommended practice: Implementations should return a value that represents
an en-
coding whose code unit size matches the size of a single wchar_t.
> [text.encoding.aliases]
> "model" is in monospace font once; it shouldn't be.
>
Fixed
>
> Regarding the editor's note: We should show a reasonably complete
> class definition in the standard.
>
Tomazs had different opinions, I'm sure LWG will tell me what to do
>
> static consteval text_encoding literal();
> static consteval text_encoding wide_literal();
>
> Use "literal encoding" terms and add a cross-reference
> to the core language section where they're defined (by my paper).
>
Done
>
> The normative text needs more notes/examples to show how the
> seemingly narrow-only IANA "charsets" are intended to map to
> the return values for wide encodings.
>
See above
>
>
> [text.encoding.comp]
> constexpr bool operator==(const text_encoding & a, const text_encoding &
> b) const noexcept;
>
> The two "return" mentions can go.
>
Done
>
>
> "Remarks: This operator induces an equivalence relation on its arguments
> if and only if i != id::other is true."
>
> So, I'm required NOT to offer an equivalence relation if i == id::other?
> That doesn't work with the specific "Returns" clause.
> Oh, and I don't know what an equivalence relation is on things
> of different types. I think the "remarks" should just go.
>
Tomasz was adamant this was useful. I'm sure LWG will tell me what to do.
>
> "narrow strings" don't exist in the standard. Do you mean
"strings whose elements are of type char"?
> Same for "wide strings".
>
Now reads "The wide execution encoding associated with the locale..." which
I think matches library wording.
>
> Jens
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
Received on 2021-09-23 04:48:46