I prefer the current approach in the paper, but I have no objection to doing what you suggest if the CWG expresses such a preference.
On Tue, Jun 30, 2020, 05:52 Tom Honermann <tom@honermann.net> wrote:
On 6/28/20 2:03 AM, Corentin Jabot wrote:
This was requested by Core in the 2020-01-16 issues processing telecon.
On Sun, 28 Jun 2020 at 07:37, Corentin Jabot <corentinjabot@gmail.com> wrote:
On Sun, Jun 28, 2020, 06:50 Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:
A new draft revision of P2029 (Proposed resolution for core issues 411, 1656, and 2333; numeric and universal character escapes in character and string literals) is now available at https://rawgit.com/sg16-unicode/sg16/master/papers/d2029r2.html. This addresses the CWG feedback provided during the March 23rd, 2020 core issues processing teleconference.
Wording review feedback prior to the next Core issues processing teleconference would be much appreciated!
I really like the overall direction, a few comments:- Can we not make conditionally supported escape sequences part of the grammar?
What problem would that solve?
What I would do:simple-escape-sequence:
any member of the basic source character set other than u, U, x, and the members of octal-digit
And in 5.13, keepEscape sequences not listed in Table 9 are conditionally supported, with implementation-defined semantics
Not having separated grammar for non standard features, simpler grammar.
Stateful encodings are still a thing. They may still be a thing in 30 years.
Stateful encodings were discussed in the 2020-03-23 issues processing telecon.- Can we not add notes for stateful encodings? It doesn't add anything.
Sure, it is still a level of detail that doesn't add anything. I would like to avoid people in 30 years wondering why that this sentences are here.
They are a thing in C (see WG14 N2176 (the final draft WP before C18) 6.4.4.4, "Character constants", p11). I believe their omission in C++ is just an oversight. Compilers support them. I think they are a thing and giving them a name is useful.Giving these odd literals a name was suggested by Core. I agree with their suggested direction; giving them a name makes it easier to discuss and define them.-- Wide multi character literals were not a thing, let's not make them one now. same for conditional character literals and conditional wide character literals.
Instead, please add text in (Z) to describe them?
ie:
-ordinary and wide characters literal consisting of a single basic-c-char, simple-escape-sequence, or universal-character-name that specifies a character that either lacks representation in the associated character encoding or that cannot be encoded as a single code unitare conditionally supported and have an implementation-defined value- A wide character literal consisting of multiple c-chars is conditionally-supported and has an implementation-defined value.
No, especially wide multi characters that are simply not a thing, let's not make them one. The reason multi character literals exists and have a name is because their type is different from character literals.
Should I send a mail to core? Because I really do not like that direction. (Especially as what you call wide multi character literal doesn't behave at all as multi character literals). We should also look at making them ill formed rather than giving them a name
Arguably, you have already sent that mail to Core :)
I don't know what behavioral difference you are concerned about.
The primary reason for differentiating them is to allow the
multicharacter case to be ill-formed (conditionally-supported)
and/or to have an encoding that differs from single c-char
literals.
I think the standard should reflect existing practice. These odd
literals are supported in common compilers. If you would like to
make them ill-formed, you are certainly free to write a paper, but
implementations are already free to make them ill-formed and I
suspect the ones that don't would retain support for them as an
extension anyway.
Please read the wording again. I don't think it states that. If you still think it does, please elaborate in detail.The intent is to make it clear that these sequences are encoded as a group. This is necessary for stateful encodings with SI/SO characters since such characters don't necessarily contribute a code unit sequence on their own. This was also requested during the 2020-03-23 issues processing telecon.
Please changeThe sequence of characters denoted by each contiguous sequence of basic-s-chars, r-chars, simple-escape-sequences ([lex.ccon]), and universal-character-names ([lex.charset]) is encoded to a code unit sequence
To
Each basic-s-chars, r-chars, simple-escape-sequences ([lex.ccon]), and universal-character-names ([lex.charset]) is encoded to a code unit sequence
The effect is that I can encode things like e,U+0301 as a single code unit, which at the very least should not be allowed in a wording change.
It's also a terrible reason as c-char and UCNs are Unicode characters at this point and cannot correspond to a statefull character as the source of the conversation. The thing they are converted to being an implementation definedsequencee of code unit, the possibility of a state shift is implied.
What are you referring to as a "terrible reason"?
SI/SO characters exist in Unicode and can therefore be
represented as UCNs. In translation phase 5, an implementation
can treat them as part of a shift sequence when converting to the
execution encoding.
That doesn't seem correct to me; the wording needs to indicate which character encoding. Note that there are three occurrences of "applicable associated character encoding"; I'm not sure which use you were referring to.
- please replace applicable character encoding by character encoding
Missed a word. Sorry. Meant associated character encoding. "Applicable associated" doesn't add anything. Maybe the "the literal associated encoding"
That says the same thing to me. If CWG expresses a preference, I'll change it.
Tom.