C++ Logo

SG16

Advanced search

Subject: Re: New draft revision: D2029R2 (Proposed resolution for core issues 411, 1656, and 2333; numeric and universal character escapes in character and string literals)
From: Tom Honermann (tom_at_[hidden])
Date: 2020-06-29 22:52:04


On 6/28/20 2:03 AM, Corentin Jabot wrote:
>
>
> On Sun, 28 Jun 2020 at 07:37, Corentin Jabot <corentinjabot_at_[hidden]
> <mailto:corentinjabot_at_[hidden]>> wrote:
>
>
>
> On Sun, Jun 28, 2020, 06:50 Tom Honermann via SG16
> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
> A new draft revision of P2029 (Proposed resolution for core
> issues 411, 1656, and 2333; numeric and universal character
> escapes in character and string literals) is now available at
> https://rawgit.com/sg16-unicode/sg16/master/papers/d2029r2.html.
> This addresses the CWG feedback provided during the March
> 23rd, 2020 core issues processing teleconference
> <http://wiki.edg.com/bin/view/Wg21summer2020/IssuesProcessingTeleconference2020-03-23>.
>
> Wording review feedback prior to the next Core issues
> processing teleconference would be much appreciated!
>
> I really like the overall direction, a few comments:
> - Can we not make conditionally supported escape sequences part of
> the grammar?
>
This was requested by Core in the 2020-01-16 issues processing telecon
<https://wiki.edg.com/bin/view/Wg21prague/IssuesProcessingTeleconference2020-01-16>.
>
> What I would do:
> simple-escape-sequence:
>     any member of the basic source character set other than u, U, x,
> and the members of octal-digit
>
> And in 5.13, keep
> Escape sequences not listed in Table 9 are conditionally supported,
> with implementation-defined semantics
What problem would that solve?
>
> I would also keep
> An escape sequence specifies a single fcode unit.
The ability for a conditional escape sequence to specify a code unit
sequence was discussed during the 2020-03-23 issues processing telecon
<https://wiki.edg.com/bin/view/Wg21summer2020/IssuesProcessingTeleconference2020-03-23>. 
Since such sequences are implementation-defined anyway, I don't know of
any reason to prohibit them expanding to multiple code units.  For
sequences that specify a character, whether a single code unit is
encoded or multiple are should be determined by the character encoding. 
If we want to enforce such a restriction, I think it belongs in
[lex.charset]p3 <http://eel.is/c++draft/lex.charset#3> (I thought we
already had normative wording that requires members of the basic source
character set be encoded as a single code unit, but I don't see it now).
>
>
>
> - Can we not add notes for stateful encodings? It doesn't add
> anything.
>
Stateful encodings were discussed in the 2020-03-23 issues processing
telecon
<https://wiki.edg.com/bin/view/Wg21summer2020/IssuesProcessingTeleconference2020-03-23>.

>
> - Wide multi character literals were not a thing, let's not make
> them one now. same for  conditional character literals and
> conditional wide character literals.
>
> Instead, please add text in (Z) to describe them?
> ie:
>
> -ordinary and wide characters literal consisting of a single
> basic-c-char, simple-escape-sequence, or universal-character-name
> that specifies a character that either lacks representation in the
> associated character encoding or that cannot be encoded as a
> single code unit
> are conditionally supported and have an implementation-defined value
> - A wide character literal consisting of multiple c-chars is
> conditionally-supported and has an implementation-defined value.
>
Giving these odd literals a name was suggested by Core.  I agree with
their suggested direction; giving them a name makes it easier to discuss
and define them.
>
>
>
> Please change
> The sequence of characters denoted by each contiguous sequence of
> basic-s-chars, r-chars, simple-escape-sequences ([lex.ccon]), and
> universal-character-names ([lex.charset]) is encoded to a code
> unit sequence
> To
> Each basic-s-chars, r-chars, simple-escape-sequences ([lex.ccon]),
> and universal-character-names ([lex.charset]) is encoded to a code
> unit sequence
>
The intent is to make it clear that these sequences are encoded as a
group.  This is necessary for stateful encodings with SI/SO characters
since such characters don't necessarily contribute a code unit sequence
on their own.  This was also requested during the 2020-03-23 issues
processing telecon
<https://wiki.edg.com/bin/view/Wg21summer2020/IssuesProcessingTeleconference2020-03-23>.
>
>
>
>
> - please replace applicable character encoding by character encoding
>
That doesn't seem correct to me; the wording needs to indicate which
character encoding.  Note that there are three occurrences of
"applicable associated character encoding"; I'm not sure which use you
were referring to.
>
> - not sure replacing `\0` by null character is an improvement
>
It avoids a correction to state something like, "a '\0', L\'0', u8'\0',
u'\0', or U'\0' is appended ...". [lex.charset]p3
<http://eel.is/c++draft/lex.charset#3> defines /null character/ (though
the definition there isn't perfect either, I think it is an improvement).

Tom.

>
>
> Corentin
>
> Tom.
>
> --
> SG16 mailing list
> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>



SG16 list run by sg16-owner@lists.isocpp.org