Date: Thu, 23 Sep 2021 08:16:09 +0200
On 23/09/2021 05.16, Hubert Tong via SG16 wrote:
> On Wed, Sep 22, 2021 at 5:33 PM Tom Honermann via SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
> On 9/22/21 5:18 PM, Corentin via SG16 wrote:
>> Hello,
>>
>> I forgot to mention that Bryce asked on the reflector whether we want to forward P1885 to electronic polling https://lists.isocpp.org/lib-ext/2021/09/20049.php <https://lists.isocpp.org/lib-ext/2021/09/20049.php>
>>
>> Hubert had some great feedback, which I hope to have addressed
>> https://isocpp.org/files/papers/D1885R8.pdf <https://isocpp.org/files/papers/D1885R8.pdf>
>>
>> I'd like to see this paper forwarded, I would love it if either you could indicate your support or let me know how I could increase consensus.
>
> I assume by forwarded, you mean forwarded from LEWG to electronic polling.
>
> I'm not aware of anything that would change the SG16 consensus for it, but I haven't caught up on the recent discussions on the LEWG mailing list yet either. I'll get caught up and follow up as necessary.
>
> I think there have been a few surprises:
> Under the author's preferred interpretation of the charsets represented by the IANA Character Set Registry, few of the registered encodings are wide encodings. The wording is designed towards high implementation freedom, so I am not sure how much of the author's intent is going to be apparent to implementers (especially if individuals not directly participating in the threads of discussion happen to be the people who end up doing the implementation).
> Also, (this is new information to me and I expect to most people as well) the paper's prose points to GCC's -fwide-exec-charset option, which really only works if the option specifies a correctly-sized wide encoding that iconv recognizes.
>
> Observe:
> $ gcc -fwide-exec-charset=ISO8859-1 -fsyntax-only -xc++ -<<<$'extern char x[L\'0\'], x[0x30];'
> <stdin>:1:28: error: conflicting declaration 'char x [48]'
> <stdin>:1:13: note: previous declaration as 'char x [805306368]'
>
> So, insofar as the example in the prose is concerned, there would need to be an iconv name for the appropriately-sized wide EBCDIC encoding.
I am confused.
The prose text says:
"Note: Because they have different code units sizes, narrow and wide strings have
different encodings."
I'd thus expect different enum ids for wide and narrow strings in the list,
but the use of
g++ -fwide-exec-charset=EBCDIC-US
[...]
Wide Literal Encoding: EBCDIC-US (iana mib: 2078)
in the example is contrary to that statement, assuming that
EBCDIC-US is generally an 8-bit encoding, not a wide encoding.
Then, we have
"Identifying Encodings
[...]
Fortunately there exist a database of registered encoding
covering almost all encodings supported by operating systems and compilers. This database
is maintained by IANA through a process described by [rfc2978].
This database lists over 250 registered character sets and for each:"
This sentence moves from the goal of talking about "encodings" to
the term "character set" without any further explanation.
This should describe that IANA / the RFC calls a "character set" what
we believe is an "encoding".
Regarding the wording:
"registered-character-set" is not a grammar term (fine), thus should
not be hyphenated (we can have defined English phrases, not just defined single
words, in the standard). Make sure to italicize a word only on its
definition, not everywhere.
The choice of words "registered-character-set" proliferates the misunderstanding
that the IANA registry talks about character sets; it talks about encodings.
The C++ wording should thus use "encoding" in its description; possibly with
a note explaining that IANA / the RFC mistakenly calls them character sets.
"IANA Character Sets registry" needs a reference to the RFC
establishing that registry. I think we can get away with adding
that reference to the bibliography (not the normative references).
"The set of known registered-character-set"
Make the "registered character set" plural: "the set of oranges", not
"the set of orange"
registery -> registry
"implementation-defined snapshot" conflicts with
"Each known registered-character-set is identified by an enumerator in text_encoding::id"
It's unclear whether an implementation is supposed to add enumerators on its own,
or not. (Personally, I think due to the low change frequency of the list,
we should just maintain the master copy of the enumerators in the standard,
which would also allow us to fix the typos and inconsistencies.
Oh, we do fix some of the typos. Can we consistently spell "Windows"
with an uppercase "W", please?)
"primary-name": not a grammar term, remove hyphen
(I don't particularly care whether the RFC hyphenates it; we seem
to have a stand-alone definition here.)
0 -> zero
"Its primary name" has ambiguous antecedent.
same for "Its set of aliases"
"No two registered-character-set": change to plural: "no two oranges are the same"
"How a text_encoding object": needs monospace font for text_encoding
"if two strings [...] are equal"
-> "if the two strings a and b [...] are equal"
add comma after "left-to-right"
"0 characters" -> presumably, you mean '0' (digit zero) characters?
Then use ''. Otherwise, use "null character" or "U+0000".
"optionally followed by code units outside of the
ranges [a-z], [A-Z], [0-9]"
What? code units are not in a range designated by regex character ranges.
You were talking about characters (I presume code points) all along, so
maybe you should avoid "code unit" entirely here.
Who came up with that comparison algorithm? Probably needs a cross-reference
so nobody blames us.
"string literal character encoding"
That doesn't exist. Do you want to talk about a literal encoding?
Or something else?
Do we expect implementations to accept text_encoding("csIBBM904") and
interpret it as "IBM904"?
The postcondition for "text_encoding(id mib)" seems to imply that a
name lookup must be done here. I thought we didn't want that.
"is a ntbs" -> "is an NTBS"
"range [name(), strlen(name())+1]"
use monospace font
"implementation-defined (wide) character encoding of the environment"
No italics for "character encoding". Also, I don't know what that
term means. We can talk about "execution (wide) character set"
and its encoding, but everything else seems undefined for now.
[text.encoding.aliases]
"model" is in monospace font once; it shouldn't be.
Regarding the editor's note: We should show a reasonably complete
class definition in the standard.
static consteval text_encoding literal();
static consteval text_encoding wide_literal();
Use "literal encoding" terms and add a cross-reference
to the core language section where they're defined (by my paper).
The normative text needs more notes/examples to show how the
seemingly narrow-only IANA "charsets" are intended to map to
the return values for wide encodings.
[text.encoding.comp]
constexpr bool operator==(const text_encoding & a, const text_encoding & b) const noexcept;
The two "return" mentions can go.
"Remarks: This operator induces an equivalence relation on its arguments
if and only if i != id::other is true."
So, I'm required NOT to offer an equivalence relation if i == id::other?
That doesn't work with the specific "Returns" clause.
Oh, and I don't know what an equivalence relation is on things
of different types. I think the "remarks" should just go.
"narrow strings" don't exist in the standard. Do you mean
"strings whose elements are of type char"?
Same for "wide strings".
Jens
> On Wed, Sep 22, 2021 at 5:33 PM Tom Honermann via SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
> On 9/22/21 5:18 PM, Corentin via SG16 wrote:
>> Hello,
>>
>> I forgot to mention that Bryce asked on the reflector whether we want to forward P1885 to electronic polling https://lists.isocpp.org/lib-ext/2021/09/20049.php <https://lists.isocpp.org/lib-ext/2021/09/20049.php>
>>
>> Hubert had some great feedback, which I hope to have addressed
>> https://isocpp.org/files/papers/D1885R8.pdf <https://isocpp.org/files/papers/D1885R8.pdf>
>>
>> I'd like to see this paper forwarded, I would love it if either you could indicate your support or let me know how I could increase consensus.
>
> I assume by forwarded, you mean forwarded from LEWG to electronic polling.
>
> I'm not aware of anything that would change the SG16 consensus for it, but I haven't caught up on the recent discussions on the LEWG mailing list yet either. I'll get caught up and follow up as necessary.
>
> I think there have been a few surprises:
> Under the author's preferred interpretation of the charsets represented by the IANA Character Set Registry, few of the registered encodings are wide encodings. The wording is designed towards high implementation freedom, so I am not sure how much of the author's intent is going to be apparent to implementers (especially if individuals not directly participating in the threads of discussion happen to be the people who end up doing the implementation).
> Also, (this is new information to me and I expect to most people as well) the paper's prose points to GCC's -fwide-exec-charset option, which really only works if the option specifies a correctly-sized wide encoding that iconv recognizes.
>
> Observe:
> $ gcc -fwide-exec-charset=ISO8859-1 -fsyntax-only -xc++ -<<<$'extern char x[L\'0\'], x[0x30];'
> <stdin>:1:28: error: conflicting declaration 'char x [48]'
> <stdin>:1:13: note: previous declaration as 'char x [805306368]'
>
> So, insofar as the example in the prose is concerned, there would need to be an iconv name for the appropriately-sized wide EBCDIC encoding.
I am confused.
The prose text says:
"Note: Because they have different code units sizes, narrow and wide strings have
different encodings."
I'd thus expect different enum ids for wide and narrow strings in the list,
but the use of
g++ -fwide-exec-charset=EBCDIC-US
[...]
Wide Literal Encoding: EBCDIC-US (iana mib: 2078)
in the example is contrary to that statement, assuming that
EBCDIC-US is generally an 8-bit encoding, not a wide encoding.
Then, we have
"Identifying Encodings
[...]
Fortunately there exist a database of registered encoding
covering almost all encodings supported by operating systems and compilers. This database
is maintained by IANA through a process described by [rfc2978].
This database lists over 250 registered character sets and for each:"
This sentence moves from the goal of talking about "encodings" to
the term "character set" without any further explanation.
This should describe that IANA / the RFC calls a "character set" what
we believe is an "encoding".
Regarding the wording:
"registered-character-set" is not a grammar term (fine), thus should
not be hyphenated (we can have defined English phrases, not just defined single
words, in the standard). Make sure to italicize a word only on its
definition, not everywhere.
The choice of words "registered-character-set" proliferates the misunderstanding
that the IANA registry talks about character sets; it talks about encodings.
The C++ wording should thus use "encoding" in its description; possibly with
a note explaining that IANA / the RFC mistakenly calls them character sets.
"IANA Character Sets registry" needs a reference to the RFC
establishing that registry. I think we can get away with adding
that reference to the bibliography (not the normative references).
"The set of known registered-character-set"
Make the "registered character set" plural: "the set of oranges", not
"the set of orange"
registery -> registry
"implementation-defined snapshot" conflicts with
"Each known registered-character-set is identified by an enumerator in text_encoding::id"
It's unclear whether an implementation is supposed to add enumerators on its own,
or not. (Personally, I think due to the low change frequency of the list,
we should just maintain the master copy of the enumerators in the standard,
which would also allow us to fix the typos and inconsistencies.
Oh, we do fix some of the typos. Can we consistently spell "Windows"
with an uppercase "W", please?)
"primary-name": not a grammar term, remove hyphen
(I don't particularly care whether the RFC hyphenates it; we seem
to have a stand-alone definition here.)
0 -> zero
"Its primary name" has ambiguous antecedent.
same for "Its set of aliases"
"No two registered-character-set": change to plural: "no two oranges are the same"
"How a text_encoding object": needs monospace font for text_encoding
"if two strings [...] are equal"
-> "if the two strings a and b [...] are equal"
add comma after "left-to-right"
"0 characters" -> presumably, you mean '0' (digit zero) characters?
Then use ''. Otherwise, use "null character" or "U+0000".
"optionally followed by code units outside of the
ranges [a-z], [A-Z], [0-9]"
What? code units are not in a range designated by regex character ranges.
You were talking about characters (I presume code points) all along, so
maybe you should avoid "code unit" entirely here.
Who came up with that comparison algorithm? Probably needs a cross-reference
so nobody blames us.
"string literal character encoding"
That doesn't exist. Do you want to talk about a literal encoding?
Or something else?
Do we expect implementations to accept text_encoding("csIBBM904") and
interpret it as "IBM904"?
The postcondition for "text_encoding(id mib)" seems to imply that a
name lookup must be done here. I thought we didn't want that.
"is a ntbs" -> "is an NTBS"
"range [name(), strlen(name())+1]"
use monospace font
"implementation-defined (wide) character encoding of the environment"
No italics for "character encoding". Also, I don't know what that
term means. We can talk about "execution (wide) character set"
and its encoding, but everything else seems undefined for now.
[text.encoding.aliases]
"model" is in monospace font once; it shouldn't be.
Regarding the editor's note: We should show a reasonably complete
class definition in the standard.
static consteval text_encoding literal();
static consteval text_encoding wide_literal();
Use "literal encoding" terms and add a cross-reference
to the core language section where they're defined (by my paper).
The normative text needs more notes/examples to show how the
seemingly narrow-only IANA "charsets" are intended to map to
the return values for wide encodings.
[text.encoding.comp]
constexpr bool operator==(const text_encoding & a, const text_encoding & b) const noexcept;
The two "return" mentions can go.
"Remarks: This operator induces an equivalence relation on its arguments
if and only if i != id::other is true."
So, I'm required NOT to offer an equivalence relation if i == id::other?
That doesn't work with the specific "Returns" clause.
Oh, and I don't know what an equivalence relation is on things
of different types. I think the "remarks" should just go.
"narrow strings" don't exist in the standard. Do you mean
"strings whose elements are of type char"?
Same for "wide strings".
Jens
Received on 2021-09-23 01:16:15