Thank you for your feedback Jens,
https://isocpp.org/files/papers/D1885R8.pdf

I hope the addition of "recommended practice" sections will resolve the questions both Hubert and you still have.


On Thu, Sep 23, 2021 at 8:16 AM Jens Maurer via SG16 <sg16@lists.isocpp.org> wrote:
On 23/09/2021 05.16, Hubert Tong via SG16 wrote:
> On Wed, Sep 22, 2021 at 5:33 PM Tom Honermann via SG16 <sg16@lists.isocpp.org <mailto:sg16@lists.isocpp.org>> wrote:
>
>     On 9/22/21 5:18 PM, Corentin via SG16 wrote:
>>     Hello,
>>
>>     I forgot to mention that Bryce asked on the reflector whether we want to forward P1885 to electronic polling https://lists.isocpp.org/lib-ext/2021/09/20049.php <https://lists.isocpp.org/lib-ext/2021/09/20049.php>
>>
>>     Hubert had some great feedback, which I hope to have addressed
>>     https://isocpp.org/files/papers/D1885R8.pdf <https://isocpp.org/files/papers/D1885R8.pdf
>>
>>     I'd like to see this paper forwarded, I would love it if either you could indicate your support or let me know how I could increase consensus.
>
>     I assume by forwarded, you mean forwarded from LEWG to electronic polling.
>
>     I'm not aware of anything that would change the SG16 consensus for it, but I haven't caught up on the recent discussions on the LEWG mailing list yet either. I'll get caught up and follow up as necessary.
>
> I think there have been a few surprises:
> Under the author's preferred interpretation of the charsets represented by the IANA Character Set Registry, few of the registered encodings are wide encodings. The wording is designed towards high implementation freedom, so I am not sure how much of the author's intent is going to be apparent to implementers (especially if individuals not directly participating in the threads of discussion happen to be the people who end up doing the implementation).


> Also, (this is new information to me and I expect to most people as well) the paper's prose points to GCC's -fwide-exec-charset option, which really only works if the option specifies a correctly-sized wide encoding that iconv recognizes.
>
> Observe:
> $ gcc -fwide-exec-charset=ISO8859-1 -fsyntax-only -xc++ -<<<$'extern char x[L\'0\'], x[0x30];'
> <stdin>:1:28: error: conflicting declaration 'char x [48]'
> <stdin>:1:13: note: previous declaration as 'char x [805306368]'
>
> So, insofar as the example in the prose is concerned, there would need to be an iconv name for the appropriately-sized wide EBCDIC encoding.

I am confused.

The prose text says:

"Note: Because they have different code units sizes, narrow and wide strings have
different encodings."

I'd thus expect different enum ids for wide and narrow strings in the list,
but the use of

g++ -fwide-exec-charset=EBCDIC-US
[...]
Wide Literal Encoding: EBCDIC-US (iana mib: 2078)

in the example is contrary to that statement, assuming that
EBCDIC-US is generally an 8-bit encoding, not a wide encoding.


Yes, this example, while conforming would not be a recommended practice,
I did modify it.
 

Then, we have

"Identifying Encodings

[...]

Fortunately there exist a database of registered encoding
covering almost all encodings supported by operating systems and compilers. This database
is maintained by IANA through a process described by [rfc2978].
This database lists over 250 registered character sets and for each:"

This sentence moves from the goal of talking about "encodings" to
the term "character set" without any further explanation.
This should describe that IANA / the RFC calls a "character set" what
we believe is an "encoding".


An encoding maps directly to a character set, the reverse is not true.
See further down,


Regarding the wording:

"registered-character-set" is not a grammar term (fine), thus should
not be hyphenated (we can have defined English phrases, not just defined single
words, in the standard).  Make sure to italicize a word only on its
definition, not everywhere.

Done.
In addition I renamed it to  registered character set to avoid confusion ( I agree the term is confusing)


The choice of words "registered-character-set" proliferates the misunderstanding
that the IANA registry talks about character sets; it talks about encodings.
The C++ wording should thus use "encoding" in its description; possibly with
a note explaining that IANA / the RFC mistakenly calls them character sets.

In addition to renaming the term, a note was added.
 

"IANA Character Sets registry" needs a reference to the RFC
establishing that registry.  I think we can get away with adding
that reference to the bibliography (not the normative references).

There was a reference already - I did add a date.
The reference is what I believe to be the primary reference of interest, it itself refers to a few more documents.

https://www.iana.org/assignments/character-sets/character-sets.xhtml
Do you think the standard needs to refer to everything directly?
Hubert observed a few month ago that IANA took over the original RFCs
 

"The set of known registered-character-set"
Make the "registered character set" plural: "the set of oranges", not
"the set of orange"

Done 
registery -> registry
 
Done
 
"implementation-defined snapshot" conflicts with
"Each known registered-character-set is identified by an enumerator in text_encoding::id"

It's unclear whether an implementation is supposed to add enumerators on its own,
or not.  (Personally, I think due to the low change frequency of the list,
we should just maintain the master copy of the enumerators in the standard,
which would also allow us to fix the typos and inconsistencies.

We have been over this a few times, Hubert was adamant snapshot was useful.
I did remove it, but added a date to the bibliography reference.
 
Oh, we do fix some of the typos. Can we consistently spell "Windows"
with an uppercase "W", please?)

This has been discussed.
Do you want SG-16/LEWG to reopen those discussions?
I would rather not


"primary-name": not a grammar term, remove hyphen
(I don't particularly care whether the RFC hyphenates it; we seem
to have a stand-alone definition here.)

Done
 

0 -> zero

Done
 

"Its primary name" has ambiguous antecedent.
same for "Its set of aliases"

Fixed
 

"No two registered-character-set": change to plural: "no two oranges are the same"

Fixed 

"How a text_encoding object": needs monospace font for text_encoding

"if two strings [...] are equal"
-> "if the two strings a and b [...] are equal"

add comma after "left-to-right"

"0 characters" -> presumably, you mean '0' (digit zero) characters?
Then use ''. Otherwise, use "null character" or "U+0000".

"optionally followed by code units outside of the
ranges [a-z], [A-Z], [0-9]"

What? code units are not in a range designated by regex character ranges.
You were talking about characters (I presume code points) all along, so
maybe you should avoid "code unit" entirely here.

Now reads

Let bool COMP_NAME(string_view a, string_view b) be a function that returns true if the two
strings a and b encoded in the literal encoding are equal ignoring, from left-to-right,
• all elements not in the basic character set,
• all elements which are not digits or letters [character.seq.general],
• character case, and
• any sequence of one or more ’0’ character not immediately preceded by a sequence consisting of a digit in the range [1-9] optionally followed by one or more element which
are not digits or letters.

[ Note: This comparison is identical to the ”Charset Alias Matching” algorithm described in the Unicode Technical Standard 22. — end note ]


Who came up with that comparison algorithm? Probably needs a cross-reference
so nobody blames us.
"string literal character encoding"

See above 

 
That doesn't exist.  Do you want to talk about a literal encoding?
Or something else?

Sure 

Do we expect implementations to accept text_encoding("csIBBM904") and
interpret it as "IBM904"?

Yes, that has been discussed with Hubert, renaming aliases would defeat the purposes of aliases
 
The postcondition for "text_encoding(id mib)" seems to imply that a
name lookup must be done here. I thought we didn't want that.

No, we added the is_ templates functions to avoid the lookup
 

"is a ntbs" -> "is an NTBS"

"range [name(), strlen(name())+1]"

use monospace font

"implementation-defined (wide) character encoding of the environment"

No italics for "character encoding".  Also, I don't know what that
term means.  We can talk about "execution (wide) character set"
and its encoding, but everything else seems undefined for now.

This entire paragraph now reads

Returns an implementation-defined value representing the wide encoding of the envi-
ronment.
On a POSIX implementation, this is the wide encoding associated with the POSIX locale
denoted by the empty string "".
[ Note: This function is not affected by calls to setlocale. It is unspecified whether this
function is affected by changes to environment variables during the lifetime of the
program. The encoding represented by the returned value of this function, if any, is not
required to meet the preconditions of all the standard wide character functions. — end
note ]
Recommended practice: Implementations should return a value that represents an en-
coding whose code unit size matches the size of a single wchar_t.

 


[text.encoding.aliases]
"model" is in monospace font once; it shouldn't be.

Fixed
 

Regarding the editor's note: We should show a reasonably complete
class definition in the standard.

Tomazs had different opinions, I'm sure LWG will tell me what to do
 

static consteval text_encoding literal();
static consteval text_encoding wide_literal();

Use "literal encoding" terms and add a cross-reference
to the core language section where they're defined (by my paper).

Done
 

The normative text needs more notes/examples to show how the
seemingly narrow-only IANA "charsets" are intended to map to
the return values for wide encodings.

See above
 


[text.encoding.comp]
constexpr bool operator==(const text_encoding & a, const text_encoding & b) const noexcept;

The two "return" mentions can go.

Done 


"Remarks: This operator induces an equivalence relation on its arguments
if and only if i != id::other is true."

So, I'm required NOT to offer an equivalence relation if i == id::other?
That doesn't work with the specific "Returns" clause.
Oh, and I don't know what an equivalence relation is on things
of different types.  I think the "remarks" should just go.

Tomasz was adamant this was useful. I'm sure LWG will tell me what to do.
 

"narrow strings" don't exist in the standard. Do you mean
"strings whose elements are of type char"?
Same for "wide strings".

Now reads "The wide execution encoding associated with the locale..." which I think matches library wording. 

Jens
--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16