C++ Logo

SG16

Advanced search

Subject: Re: New draft revision: D2029R2 (Proposed resolution for core issues 411, 1656, and 2333; numeric and universal character escapes in character and string literals)
From: Tom Honermann (tom_at_[hidden])
Date: 2020-06-29 23:49:13


On 6/30/20 12:15 AM, Corentin Jabot wrote:
>
>
> On Tue, Jun 30, 2020, 05:52 Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> On 6/28/20 2:03 AM, Corentin Jabot wrote:
>>
>>
>> On Sun, 28 Jun 2020 at 07:37, Corentin Jabot
>> <corentinjabot_at_[hidden] <mailto:corentinjabot_at_[hidden]>> wrote:
>>
>>
>>
>> On Sun, Jun 28, 2020, 06:50 Tom Honermann via SG16
>> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>>
>> A new draft revision of P2029 (Proposed resolution for
>> core issues 411, 1656, and 2333; numeric and universal
>> character escapes in character and string literals) is
>> now available at
>> https://rawgit.com/sg16-unicode/sg16/master/papers/d2029r2.html.
>> This addresses the CWG feedback provided during the March
>> 23rd, 2020 core issues processing teleconference
>> <http://wiki.edg.com/bin/view/Wg21summer2020/IssuesProcessingTeleconference2020-03-23>.
>>
>> Wording review feedback prior to the next Core issues
>> processing teleconference would be much appreciated!
>>
>> I really like the overall direction, a few comments:
>> - Can we not make conditionally supported escape sequences
>> part of the grammar?
>>
> This was requested by Core in the 2020-01-16 issues processing
> telecon
> <https://wiki.edg.com/bin/view/Wg21prague/IssuesProcessingTeleconference2020-01-16>.
>>
>> What I would do:
>> simple-escape-sequence:
>>     any member of the basic source character set other than u, U,
>> x, and the members of octal-digit
>>
>> And in 5.13, keep
>> Escape sequences not listed in Table 9 are conditionally
>> supported, with implementation-defined semantics
> What problem would that solve?
>
>
> Not having separated grammar for non standard features, simpler grammar.
I prefer the current approach in the paper, but I have no objection to
doing what you suggest if the CWG expresses such a preference.
>
>
>> - Can we not add notes for stateful encodings? It doesn't add
>> anything.
>>
> Stateful encodings were discussed in the 2020-03-23 issues
> processing telecon
> <https://wiki.edg.com/bin/view/Wg21summer2020/IssuesProcessingTeleconference2020-03-23>.
>
>
>
> Sure, it is still a level of detail that doesn't add anything. I would
> like to avoid people in 30 years wondering why that this sentences are
> here.
Stateful encodings are still a thing.  They may still be a thing in 30
years.
>
>> -- Wide multi character literals were not a thing, let's not
>> make them one now. same for  conditional character literals
>> and conditional wide character literals.
>>
>> Instead, please add text in (Z) to describe them?
>> ie:
>>
>> -ordinary and wide characters literal consisting of a single
>> basic-c-char, simple-escape-sequence, or
>> universal-character-name that specifies a character that
>> either lacks representation in the associated character
>> encoding or that cannot be encoded as a single code unit
>> are conditionally supported and have an
>> implementation-defined value
>> - A wide character literal consisting of multiple c-chars is
>> conditionally-supported and has an implementation-defined value.
>>
> Giving these odd literals a name was suggested by Core. I agree
> with their suggested direction; giving them a name makes it easier
> to discuss and define them.
>
>
>
> No, especially wide multi characters that are simply not a thing,
> let's not make them one. The reason multi character literals exists
> and have a name is because their type is different from character
> literals.
They are a thing in C (see WG14 N2176 (the final draft WP before C18)
6.4.4.4, "Character constants", p11).  I believe their omission in C++
is just an oversight.  Compilers support them.  I think they are a thing
and giving them a name is useful.
> Should I send a mail to core? Because I really do not like that
> direction. (Especially as what you call wide multi character literal
> doesn't behave at all as multi character literals). We should also
> look at making them ill formed rather than giving them a name

Arguably, you have already sent that mail to Core :)

I don't know what behavioral difference you are concerned about. The
primary reason for differentiating them is to allow the multicharacter
case to be ill-formed (conditionally-supported) and/or to have an
encoding that differs from single c-char literals.

I think the standard should reflect existing practice.  These odd
literals are supported in common compilers.  If you would like to make
them ill-formed, you are certainly free to write a paper, but
implementations are already free to make them ill-formed and I suspect
the ones that don't would retain support for them as an extension anyway.

>>
>>
>> Please change
>> The sequence of characters denoted by each contiguous
>> sequence of basic-s-chars, r-chars, simple-escape-sequences
>> ([lex.ccon]), and universal-character-names ([lex.charset])
>> is encoded to a code unit sequence
>> To
>> Each basic-s-chars, r-chars, simple-escape-sequences
>> ([lex.ccon]), and universal-character-names ([lex.charset])
>> is encoded to a code unit sequence
>>
> The intent is to make it clear that these sequences are encoded as
> a group.  This is necessary for stateful encodings with SI/SO
> characters since such characters don't necessarily contribute a
> code unit sequence on their own.  This was also requested during
> the 2020-03-23 issues processing telecon
> <https://wiki.edg.com/bin/view/Wg21summer2020/IssuesProcessingTeleconference2020-03-23>.
>
>
> The effect is that I can encode things like e,U+0301 as a single code
> unit, which at the very least should not be allowed in a wording change.
Please read the wording again.  I don't think it states that.  If you
still think it does, please elaborate in detail.
> It's also a terrible reason as c-char and UCNs are Unicode characters
> at this point and cannot correspond to a statefull character as the
> source of the conversation. The thing they are converted to being an
> implementation definedsequencee of code unit, the possibility of a
> state shift is implied.

What are you referring to as a "terrible reason"?

SI/SO characters exist in Unicode and can therefore be represented as
UCNs.  In translation phase 5, an implementation can treat them as part
of a shift sequence when converting to the execution encoding.

>>
>>
>>
>> - please replace applicable character encoding by character
>> encoding
>>
> That doesn't seem correct to me; the wording needs to indicate
> which character encoding.  Note that there are three occurrences
> of "applicable associated character encoding"; I'm not sure which
> use you were referring to.
>
>
> Missed a word. Sorry. Meant associated character encoding. "Applicable
> associated" doesn't add anything. Maybe the "the literal associated
> encoding"

That says the same thing to me.  If CWG expresses a preference, I'll
change it.

Tom.



SG16 list run by sg16-owner@lists.isocpp.org