C++ Logo


Advanced search

Re: [SG16] New draft revision: D2029R2 (Proposed resolution for core issues 411, 1656, and 2333; numeric and universal character escapes in character and string literals)

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 30 Jun 2020 00:49:13 -0400
On 6/30/20 12:15 AM, Corentin Jabot wrote:
> On Tue, Jun 30, 2020, 05:52 Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
> On 6/28/20 2:03 AM, Corentin Jabot wrote:
>> On Sun, 28 Jun 2020 at 07:37, Corentin Jabot
>> <corentinjabot_at_[hidden] <mailto:corentinjabot_at_[hidden]>> wrote:
>> On Sun, Jun 28, 2020, 06:50 Tom Honermann via SG16
>> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>> A new draft revision of P2029 (Proposed resolution for
>> core issues 411, 1656, and 2333; numeric and universal
>> character escapes in character and string literals) is
>> now available at
>> https://rawgit.com/sg16-unicode/sg16/master/papers/d2029r2.html.
>> This addresses the CWG feedback provided during the March
>> 23rd, 2020 core issues processing teleconference
>> <http://wiki.edg.com/bin/view/Wg21summer2020/IssuesProcessingTeleconference2020-03-23>.
>> Wording review feedback prior to the next Core issues
>> processing teleconference would be much appreciated!
>> I really like the overall direction, a few comments:
>> - Can we not make conditionally supported escape sequences
>> part of the grammar?
> This was requested by Core in the 2020-01-16 issues processing
> telecon
> <https://wiki.edg.com/bin/view/Wg21prague/IssuesProcessingTeleconference2020-01-16>.
>> What I would do:
>> simple-escape-sequence:
>> any member of the basic source character set other than u, U,
>> x, and the members of octal-digit
>> And in 5.13, keep
>> Escape sequences not listed in Table 9 are conditionally
>> supported, with implementation-defined semantics
> What problem would that solve?
> Not having separated grammar for non standard features, simpler grammar.
I prefer the current approach in the paper, but I have no objection to
doing what you suggest if the CWG expresses such a preference.
>> - Can we not add notes for stateful encodings? It doesn't add
>> anything.
> Stateful encodings were discussed in the 2020-03-23 issues
> processing telecon
> <https://wiki.edg.com/bin/view/Wg21summer2020/IssuesProcessingTeleconference2020-03-23>.
> Sure, it is still a level of detail that doesn't add anything. I would
> like to avoid people in 30 years wondering why that this sentences are
> here.
Stateful encodings are still a thing. They may still be a thing in 30
>> -- Wide multi character literals were not a thing, let's not
>> make them one now. same for conditional character literals
>> and conditional wide character literals.
>> Instead, please add text in (Z) to describe them?
>> ie:
>> -ordinary and wide characters literal consisting of a single
>> basic-c-char, simple-escape-sequence, or
>> universal-character-name that specifies a character that
>> either lacks representation in the associated character
>> encoding or that cannot be encoded as a single code unit
>> are conditionally supported and have an
>> implementation-defined value
>> - A wide character literal consisting of multiple c-chars is
>> conditionally-supported and has an implementation-defined value.
> Giving these odd literals a name was suggested by Core. I agree
> with their suggested direction; giving them a name makes it easier
> to discuss and define them.
> No, especially wide multi characters that are simply not a thing,
> let's not make them one. The reason multi character literals exists
> and have a name is because their type is different from character
> literals.
They are a thing in C (see WG14 N2176 (the final draft WP before C18), "Character constants", p11). I believe their omission in C++
is just an oversight. Compilers support them. I think they are a thing
and giving them a name is useful.
> Should I send a mail to core? Because I really do not like that
> direction. (Especially as what you call wide multi character literal
> doesn't behave at all as multi character literals). We should also
> look at making them ill formed rather than giving them a name

Arguably, you have already sent that mail to Core :)

I don't know what behavioral difference you are concerned about. The
primary reason for differentiating them is to allow the multicharacter
case to be ill-formed (conditionally-supported) and/or to have an
encoding that differs from single c-char literals.

I think the standard should reflect existing practice. These odd
literals are supported in common compilers. If you would like to make
them ill-formed, you are certainly free to write a paper, but
implementations are already free to make them ill-formed and I suspect
the ones that don't would retain support for them as an extension anyway.

>> Please change
>> The sequence of characters denoted by each contiguous
>> sequence of basic-s-chars, r-chars, simple-escape-sequences
>> ([lex.ccon]), and universal-character-names ([lex.charset])
>> is encoded to a code unit sequence
>> To
>> Each basic-s-chars, r-chars, simple-escape-sequences
>> ([lex.ccon]), and universal-character-names ([lex.charset])
>> is encoded to a code unit sequence
> The intent is to make it clear that these sequences are encoded as
> a group. This is necessary for stateful encodings with SI/SO
> characters since such characters don't necessarily contribute a
> code unit sequence on their own. This was also requested during
> the 2020-03-23 issues processing telecon
> <https://wiki.edg.com/bin/view/Wg21summer2020/IssuesProcessingTeleconference2020-03-23>.
> The effect is that I can encode things like e,U+0301 as a single code
> unit, which at the very least should not be allowed in a wording change.
Please read the wording again. I don't think it states that. If you
still think it does, please elaborate in detail.
> It's also a terrible reason as c-char and UCNs are Unicode characters
> at this point and cannot correspond to a statefull character as the
> source of the conversation. The thing they are converted to being an
> implementation definedsequencee of code unit, the possibility of a
> state shift is implied.

What are you referring to as a "terrible reason"?

SI/SO characters exist in Unicode and can therefore be represented as
UCNs. In translation phase 5, an implementation can treat them as part
of a shift sequence when converting to the execution encoding.

>> - please replace applicable character encoding by character
>> encoding
> That doesn't seem correct to me; the wording needs to indicate
> which character encoding. Note that there are three occurrences
> of "applicable associated character encoding"; I'm not sure which
> use you were referring to.
> Missed a word. Sorry. Meant associated character encoding. "Applicable
> associated" doesn't add anything. Maybe the "the literal associated
> encoding"

That says the same thing to me. If CWG expresses a preference, I'll
change it.


Received on 2020-06-29 23:52:28