Date: Thu, 23 Sep 2021 13:34:59 +0200
On Thu, Sep 23, 2021 at 1:06 PM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
> On 23/09/2021 11.48, Corentin Jabot wrote:
> > Thank you for your feedback Jens,
> > https://isocpp.org/files/papers/D1885R8.pdf <
> https://isocpp.org/files/papers/D1885R8.pdf>
>
> More comments:
>
> A note in the paper says:
>
> [ Note: The name and value of each enumerator in the text_encoding::id
> enum is identical to
> those specified in [rfc3808] except for the following modifications:
> • the ”cs” prefix is removed from each name
> • csUnicode is renamed text_encoding::id::UCS2
> • csIBBM904 is renamed text_encoding::id::IBM904 ]
>
> If we have a note referring to RFC 3808, that needs to be in bibliography.
>
You are right, I'll update that reference, it should be iana now
>
> I think what confuses me here big time is the following:
>
> - the primary name and the aliases are taking from the IANA table
>
> - the enumerator names are taken from RFC 3808, which is
> meanwhile obsoleted by the progress in the IANA table.
> In particular, CP50220 doesn't exist in RFC 3808, so what's
> the rationale for having it (under this name) in the enumerators?
>
> Shouldn't we just use the IANA table as the source
> of the enumerators, considering this sentence from RFC 3808:
> "Enum names are derived from the IANA Charset Registry 'Alias'
> fields that begin with 'cs' (for character set)."
>
> Suggestion for the note:
>
> "[ Note: The name and value of each enumerator in the text_encoding::id
> enum is identical to
> those specified in [rfc3808] except for the following modifications:"
>
> ->
>
> "[ Note: The name of each enumerator of the enumeration text_encoding::id
> is derived from the alias of each primary name that begins with "cs",
> as follows:"
>
>
Agreed
>
> I'm not seeing rationale in the prose section of the paper why
> NATS-DANO and NATS-DANO-ADD are excluded from the ids, if everything
> else is.
>
I'll add some prose
>
>
> Typo: "in the the"
>
>
> "[ Note: This comparison is identical to the ”Charset Alias Matching”
> algorithm described in the
> Unicode Technical Standard 22. — end note ]"
>
> Please add an entry in the bibliography for that Technical Standard 22.
>
Done
>
>
> "followed by one or more element"
>
> plural "elements"
>
>
>
> > I hope the addition of "recommended practice" sections will resolve the
> questions both Hubert and you still have.
> >
> >
> > On Thu, Sep 23, 2021 at 8:16 AM Jens Maurer via SG16 <
> sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
> >
> > > Also, (this is new information to me and I expect to most people
> as well) the paper's prose points to GCC's -fwide-exec-charset option,
> which really only works if the option specifies a correctly-sized wide
> encoding that iconv recognizes.
> > >
> > > Observe:
> > > $ gcc -fwide-exec-charset=ISO8859-1 -fsyntax-only -xc++
> -<<<$'extern char x[L\'0\'], x[0x30];'
> > > <stdin>:1:28: error: conflicting declaration 'char x [48]'
> > > <stdin>:1:13: note: previous declaration as 'char x [805306368]'
> > >
> > > So, insofar as the example in the prose is concerned, there would
> need to be an iconv name for the appropriately-sized wide EBCDIC encoding.
> >
> > I am confused.
> >
> > The prose text says:
> >
> > "Note: Because they have different code units sizes, narrow and wide
> strings have
> > different encodings."
> >
> > I'd thus expect different enum ids for wide and narrow strings in
> the list,
> > but the use of
> >
> > g++ -fwide-exec-charset=EBCDIC-US
> > [...]
> > Wide Literal Encoding: EBCDIC-US (iana mib: 2078)
> >
> > in the example is contrary to that statement, assuming that
> > EBCDIC-US is generally an 8-bit encoding, not a wide encoding.
> >
> >
> > Yes, this example, while conforming would not be a recommended practice,
> > I did modify it.
>
> The net result (i.e. essentially recommending the funny x- prefixes)
> doesn't feel like an improvement to me.
>
> > Then, we have
> >
> > "Identifying Encodings
> >
> > [...]
> >
> > Fortunately there exist a database of registered encoding
> > covering almost all encodings supported by operating systems and
> compilers. This database
> > is maintained by IANA through a process described by [rfc2978].
> > This database lists over 250 registered character sets and for each:"
> >
> > This sentence moves from the goal of talking about "encodings" to
> > the term "character set" without any further explanation.
> > This should describe that IANA / the RFC calls a "character set" what
> > we believe is an "encoding".
> >
> >
> >
> > An encoding maps directly to a character set, the reverse is not true.
>
> Fine, but I'd still suggest to tweak the prose such that the
> mental transition from encoding to character set is explicit.
> I suspect we want to say here that the IANA tables claim to
> discuss character sets, but actually present encodings,
> according to Unicode parlance.
>
Yup
>
> > "IANA Character Sets registry" needs a reference to the RFC
> > establishing that registry. I think we can get away with adding
> > that reference to the bibliography (not the normative references).
> >
> >
> > There was a reference already - I did add a date.
> > The reference is what I believe to be the primary reference of interest,
> it itself refers to a few more documents.
>
> Ok, fine.
>
> > https://www.iana.org/assignments/character-sets/character-sets.xhtml <
> https://www.iana.org/assignments/character-sets/character-sets.xhtml>
> > Do you think the standard needs to refer to everything directly?
> > Hubert observed a few month ago that IANA took over the original RFCs
>
> So, maybe the right approach is to not refer to RFC 3808 at all.
>
Yup
>
> > "implementation-defined snapshot" conflicts with
> > "Each known registered-character-set is identified by an enumerator
> in text_encoding::id"
> >
> > It's unclear whether an implementation is supposed to add
> enumerators on its own,
> > or not. (Personally, I think due to the low change frequency of the
> list,
> > we should just maintain the master copy of the enumerators in the
> standard,
> > which would also allow us to fix the typos and inconsistencies.
> >
> >
> > We have been over this a few times, Hubert was adamant snapshot was
> useful.
>
> Ok, so this needs more discussion, then.
> (Maybe we want the enumerators to be specified by the standard,
> but mib() can return values not represented by a specified
> enumerator, if the implementation uses a later version of the IANA
> table.)
>
> > Oh, we do fix some of the typos. Can we consistently spell "Windows"
> > with an uppercase "W", please?)
> >
> >
> > This has been discussed.
>
> I thought the discussion said we'd use the RFC 3808-specified names
> verbatim. Yet, we fix the IBBM typo. Why can't we fix the
> "windows" typo, too?
>
> > Do you want SG-16/LEWG to reopen those discussions?
>
> Yes. I'd like to know what the rule is for deriving enumerator names
> from the sources.
> > Now reads
> >
> > Let bool COMP_NAME(string_view a, string_view b) be a function that
> returns true if the two
> > strings a and b encoded in the literal encoding are equal ignoring, from
> left-to-right,
> > • all elements not in the basic character set,
>
> cross-reference to the core language section, please
>
> What are "elements" here? Do we mean code points?
> (This means reconstructing a code point from a multi-byte
> encoding, e.g. one where @ is mapped to two code units.
> I think that's what we want, absent declaring a general
> restriction on encoding names to be from the basic character
> set only, which seems eminently reasonable.)
>
I don't think it matters either way.
The characters we care about are always a single code unit.
I'd rather implementers not have to do fancy transcoding here.
> Do we expect implementations to accept text_encoding("csIBBM904") and
> > interpret it as "IBM904"?
> >
> >
> > Yes, that has been discussed with Hubert, renaming aliases would defeat
> the purposes of aliases
>
> Fine.
>
> >
> > The postcondition for "text_encoding(id mib)" seems to imply that a
> > name lookup must be done here. I thought we didn't want that.
> >
> >
> > No, we added the is_ templates functions to avoid the lookup
>
> For the environment, agreed.
> For the literals, it's consteval, which should cover that.
>
> Do we have implementation experience that we can avoid the string
> table in the executable if call literal() ?
Yes !
> Jens
>
> On 23/09/2021 11.48, Corentin Jabot wrote:
> > Thank you for your feedback Jens,
> > https://isocpp.org/files/papers/D1885R8.pdf <
> https://isocpp.org/files/papers/D1885R8.pdf>
>
> More comments:
>
> A note in the paper says:
>
> [ Note: The name and value of each enumerator in the text_encoding::id
> enum is identical to
> those specified in [rfc3808] except for the following modifications:
> • the ”cs” prefix is removed from each name
> • csUnicode is renamed text_encoding::id::UCS2
> • csIBBM904 is renamed text_encoding::id::IBM904 ]
>
> If we have a note referring to RFC 3808, that needs to be in bibliography.
>
You are right, I'll update that reference, it should be iana now
>
> I think what confuses me here big time is the following:
>
> - the primary name and the aliases are taking from the IANA table
>
> - the enumerator names are taken from RFC 3808, which is
> meanwhile obsoleted by the progress in the IANA table.
> In particular, CP50220 doesn't exist in RFC 3808, so what's
> the rationale for having it (under this name) in the enumerators?
>
> Shouldn't we just use the IANA table as the source
> of the enumerators, considering this sentence from RFC 3808:
> "Enum names are derived from the IANA Charset Registry 'Alias'
> fields that begin with 'cs' (for character set)."
>
> Suggestion for the note:
>
> "[ Note: The name and value of each enumerator in the text_encoding::id
> enum is identical to
> those specified in [rfc3808] except for the following modifications:"
>
> ->
>
> "[ Note: The name of each enumerator of the enumeration text_encoding::id
> is derived from the alias of each primary name that begins with "cs",
> as follows:"
>
>
Agreed
>
> I'm not seeing rationale in the prose section of the paper why
> NATS-DANO and NATS-DANO-ADD are excluded from the ids, if everything
> else is.
>
I'll add some prose
>
>
> Typo: "in the the"
>
>
> "[ Note: This comparison is identical to the ”Charset Alias Matching”
> algorithm described in the
> Unicode Technical Standard 22. — end note ]"
>
> Please add an entry in the bibliography for that Technical Standard 22.
>
Done
>
>
> "followed by one or more element"
>
> plural "elements"
>
>
>
> > I hope the addition of "recommended practice" sections will resolve the
> questions both Hubert and you still have.
> >
> >
> > On Thu, Sep 23, 2021 at 8:16 AM Jens Maurer via SG16 <
> sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
> >
> > > Also, (this is new information to me and I expect to most people
> as well) the paper's prose points to GCC's -fwide-exec-charset option,
> which really only works if the option specifies a correctly-sized wide
> encoding that iconv recognizes.
> > >
> > > Observe:
> > > $ gcc -fwide-exec-charset=ISO8859-1 -fsyntax-only -xc++
> -<<<$'extern char x[L\'0\'], x[0x30];'
> > > <stdin>:1:28: error: conflicting declaration 'char x [48]'
> > > <stdin>:1:13: note: previous declaration as 'char x [805306368]'
> > >
> > > So, insofar as the example in the prose is concerned, there would
> need to be an iconv name for the appropriately-sized wide EBCDIC encoding.
> >
> > I am confused.
> >
> > The prose text says:
> >
> > "Note: Because they have different code units sizes, narrow and wide
> strings have
> > different encodings."
> >
> > I'd thus expect different enum ids for wide and narrow strings in
> the list,
> > but the use of
> >
> > g++ -fwide-exec-charset=EBCDIC-US
> > [...]
> > Wide Literal Encoding: EBCDIC-US (iana mib: 2078)
> >
> > in the example is contrary to that statement, assuming that
> > EBCDIC-US is generally an 8-bit encoding, not a wide encoding.
> >
> >
> > Yes, this example, while conforming would not be a recommended practice,
> > I did modify it.
>
> The net result (i.e. essentially recommending the funny x- prefixes)
> doesn't feel like an improvement to me.
>
> > Then, we have
> >
> > "Identifying Encodings
> >
> > [...]
> >
> > Fortunately there exist a database of registered encoding
> > covering almost all encodings supported by operating systems and
> compilers. This database
> > is maintained by IANA through a process described by [rfc2978].
> > This database lists over 250 registered character sets and for each:"
> >
> > This sentence moves from the goal of talking about "encodings" to
> > the term "character set" without any further explanation.
> > This should describe that IANA / the RFC calls a "character set" what
> > we believe is an "encoding".
> >
> >
> >
> > An encoding maps directly to a character set, the reverse is not true.
>
> Fine, but I'd still suggest to tweak the prose such that the
> mental transition from encoding to character set is explicit.
> I suspect we want to say here that the IANA tables claim to
> discuss character sets, but actually present encodings,
> according to Unicode parlance.
>
Yup
>
> > "IANA Character Sets registry" needs a reference to the RFC
> > establishing that registry. I think we can get away with adding
> > that reference to the bibliography (not the normative references).
> >
> >
> > There was a reference already - I did add a date.
> > The reference is what I believe to be the primary reference of interest,
> it itself refers to a few more documents.
>
> Ok, fine.
>
> > https://www.iana.org/assignments/character-sets/character-sets.xhtml <
> https://www.iana.org/assignments/character-sets/character-sets.xhtml>
> > Do you think the standard needs to refer to everything directly?
> > Hubert observed a few month ago that IANA took over the original RFCs
>
> So, maybe the right approach is to not refer to RFC 3808 at all.
>
Yup
>
> > "implementation-defined snapshot" conflicts with
> > "Each known registered-character-set is identified by an enumerator
> in text_encoding::id"
> >
> > It's unclear whether an implementation is supposed to add
> enumerators on its own,
> > or not. (Personally, I think due to the low change frequency of the
> list,
> > we should just maintain the master copy of the enumerators in the
> standard,
> > which would also allow us to fix the typos and inconsistencies.
> >
> >
> > We have been over this a few times, Hubert was adamant snapshot was
> useful.
>
> Ok, so this needs more discussion, then.
> (Maybe we want the enumerators to be specified by the standard,
> but mib() can return values not represented by a specified
> enumerator, if the implementation uses a later version of the IANA
> table.)
>
> > Oh, we do fix some of the typos. Can we consistently spell "Windows"
> > with an uppercase "W", please?)
> >
> >
> > This has been discussed.
>
> I thought the discussion said we'd use the RFC 3808-specified names
> verbatim. Yet, we fix the IBBM typo. Why can't we fix the
> "windows" typo, too?
>
> > Do you want SG-16/LEWG to reopen those discussions?
>
> Yes. I'd like to know what the rule is for deriving enumerator names
> from the sources.
> > Now reads
> >
> > Let bool COMP_NAME(string_view a, string_view b) be a function that
> returns true if the two
> > strings a and b encoded in the literal encoding are equal ignoring, from
> left-to-right,
> > • all elements not in the basic character set,
>
> cross-reference to the core language section, please
>
> What are "elements" here? Do we mean code points?
> (This means reconstructing a code point from a multi-byte
> encoding, e.g. one where @ is mapped to two code units.
> I think that's what we want, absent declaring a general
> restriction on encoding names to be from the basic character
> set only, which seems eminently reasonable.)
>
I don't think it matters either way.
The characters we care about are always a single code unit.
I'd rather implementers not have to do fancy transcoding here.
> Do we expect implementations to accept text_encoding("csIBBM904") and
> > interpret it as "IBM904"?
> >
> >
> > Yes, that has been discussed with Hubert, renaming aliases would defeat
> the purposes of aliases
>
> Fine.
>
> >
> > The postcondition for "text_encoding(id mib)" seems to imply that a
> > name lookup must be done here. I thought we didn't want that.
> >
> >
> > No, we added the is_ templates functions to avoid the lookup
>
> For the environment, agreed.
> For the literals, it's consteval, which should cover that.
>
> Do we have implementation experience that we can avoid the string
> table in the executable if call literal() ?
Yes !
> Jens
>
Received on 2021-09-23 06:35:19