sg16: Re: [SG16] P1885 polling

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Thu, 23 Sep 2021 13:06:46 +0200

On 23/09/2021 11.48, Corentin Jabot wrote:
> Thank you for your feedback Jens,
> https://isocpp.org/files/papers/D1885R8.pdf <https://isocpp.org/files/papers/D1885R8.pdf>

More comments:

A note in the paper says:

[ Note: The name and value of each enumerator in the text_encoding::id enum is identical to
those specified in [rfc3808] except for the following modifications:
• the ”cs” prefix is removed from each name
• csUnicode is renamed text_encoding::id::UCS2
• csIBBM904 is renamed text_encoding::id::IBM904 ]

If we have a note referring to RFC 3808, that needs to be in bibliography.

I think what confuses me here big time is the following:

- the primary name and the aliases are taking from the IANA table

- the enumerator names are taken from RFC 3808, which is
meanwhile obsoleted by the progress in the IANA table.
In particular, CP50220 doesn't exist in RFC 3808, so what's
the rationale for having it (under this name) in the enumerators?

Shouldn't we just use the IANA table as the source
of the enumerators, considering this sentence from RFC 3808:

"Enum names are derived from the IANA Charset Registry 'Alias'
fields that begin with 'cs' (for character set)."

Suggestion for the note:

"[ Note: The name and value of each enumerator in the text_encoding::id enum is identical to
those specified in [rfc3808] except for the following modifications:"

->

"[ Note: The name of each enumerator of the enumeration text_encoding::id
is derived from the alias of each primary name that begins with "cs",
as follows:"

I'm not seeing rationale in the prose section of the paper why
NATS-DANO and NATS-DANO-ADD are excluded from the ids, if everything
else is.

Typo: "in the the"

"[ Note: This comparison is identical to the ”Charset Alias Matching” algorithm described in the
Unicode Technical Standard 22. — end note ]"

Please add an entry in the bibliography for that Technical Standard 22.

"followed by one or more element"

plural "elements"

> I hope the addition of "recommended practice" sections will resolve the questions both Hubert and you still have.
>
>
> On Thu, Sep 23, 2021 at 8:16 AM Jens Maurer via SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:

>
> > Also, (this is new information to me and I expect to most people as well) the paper's prose points to GCC's -fwide-exec-charset option, which really only works if the option specifies a correctly-sized wide encoding that iconv recognizes.
> >
> > Observe:
> > $ gcc -fwide-exec-charset=ISO8859-1 -fsyntax-only -xc++ -<<<$'extern char x[L\'0\'], x[0x30];'
> > <stdin>:1:28: error: conflicting declaration 'char x [48]'
> > <stdin>:1:13: note: previous declaration as 'char x [805306368]'
> >
> > So, insofar as the example in the prose is concerned, there would need to be an iconv name for the appropriately-sized wide EBCDIC encoding.
>
> I am confused.
>
> The prose text says:
>
> "Note: Because they have different code units sizes, narrow and wide strings have
> different encodings."
>
> I'd thus expect different enum ids for wide and narrow strings in the list,
> but the use of
>
> g++ -fwide-exec-charset=EBCDIC-US
> [...]
> Wide Literal Encoding: EBCDIC-US (iana mib: 2078)
>
> in the example is contrary to that statement, assuming that
> EBCDIC-US is generally an 8-bit encoding, not a wide encoding.
>
>
> Yes, this example, while conforming would not be a recommended practice,
> I did modify it.

The net result (i.e. essentially recommending the funny x- prefixes)
doesn't feel like an improvement to me.

> Then, we have
>
> "Identifying Encodings
>
> [...]
>
> Fortunately there exist a database of registered encoding
> covering almost all encodings supported by operating systems and compilers. This database
> is maintained by IANA through a process described by [rfc2978].
> This database lists over 250 registered character sets and for each:"
>
> This sentence moves from the goal of talking about "encodings" to
> the term "character set" without any further explanation.
> This should describe that IANA / the RFC calls a "character set" what
> we believe is an "encoding".
>
>
>
> An encoding maps directly to a character set, the reverse is not true.

Fine, but I'd still suggest to tweak the prose such that the
mental transition from encoding to character set is explicit.
I suspect we want to say here that the IANA tables claim to
discuss character sets, but actually present encodings,
according to Unicode parlance.

> "IANA Character Sets registry" needs a reference to the RFC
> establishing that registry. I think we can get away with adding
> that reference to the bibliography (not the normative references).
>
>
> There was a reference already - I did add a date.
> The reference is what I believe to be the primary reference of interest, it itself refers to a few more documents.

Ok, fine.

> https://www.iana.org/assignments/character-sets/character-sets.xhtml <https://www.iana.org/assignments/character-sets/character-sets.xhtml>
> Do you think the standard needs to refer to everything directly?
> Hubert observed a few month ago that IANA took over the original RFCs

So, maybe the right approach is to not refer to RFC 3808 at all.

> "implementation-defined snapshot" conflicts with
> "Each known registered-character-set is identified by an enumerator in text_encoding::id"
>
> It's unclear whether an implementation is supposed to add enumerators on its own,
> or not. (Personally, I think due to the low change frequency of the list,
> we should just maintain the master copy of the enumerators in the standard,
> which would also allow us to fix the typos and inconsistencies.
>
>
> We have been over this a few times, Hubert was adamant snapshot was useful.

Ok, so this needs more discussion, then.
(Maybe we want the enumerators to be specified by the standard,
but mib() can return values not represented by a specified
enumerator, if the implementation uses a later version of the IANA
table.)

> Oh, we do fix some of the typos. Can we consistently spell "Windows"
> with an uppercase "W", please?)
>
>
> This has been discussed.

I thought the discussion said we'd use the RFC 3808-specified names
verbatim. Yet, we fix the IBBM typo. Why can't we fix the
"windows" typo, too?

> Do you want SG-16/LEWG to reopen those discussions?

Yes. I'd like to know what the rule is for deriving enumerator names
from the sources.
> Now reads
>
> Let bool COMP_NAME(string_view a, string_view b) be a function that returns true if the two
> strings a and b encoded in the literal encoding are equal ignoring, from left-to-right,
> • all elements not in the basic character set,

cross-reference to the core language section, please

What are "elements" here? Do we mean code points?
(This means reconstructing a code point from a multi-byte
encoding, e.g. one where @ is mapped to two code units.
I think that's what we want, absent declaring a general
restriction on encoding names to be from the basic character
set only, which seems eminently reasonable.)

> Do we expect implementations to accept text_encoding("csIBBM904") and
> interpret it as "IBM904"?
>
>
> Yes, that has been discussed with Hubert, renaming aliases would defeat the purposes of aliases

Fine.

>
> The postcondition for "text_encoding(id mib)" seems to imply that a
> name lookup must be done here. I thought we didn't want that.
>
>
> No, we added the is_ templates functions to avoid the lookup

For the environment, agreed.
For the literals, it's consteval, which should cover that.

Do we have implementation experience that we can avoid the string
table in the executable if call literal() ?

Jens

Received on 2021-09-23 06:06:51