C++ Logo

SG16

Advanced search

Subject: Re: New draft revision: D2029R2 (Proposed resolution for core issues 411, 1656, and 2333; numeric and universal character escapes in character and string literals)
From: Corentin Jabot (corentinjabot_at_[hidden])
Date: 2020-06-30 00:31:34


On Tue, 30 Jun 2020 at 06:49, Tom Honermann <tom_at_[hidden]> wrote:

> On 6/30/20 12:15 AM, Corentin Jabot wrote:
>
>
>
> On Tue, Jun 30, 2020, 05:52 Tom Honermann <tom_at_[hidden]> wrote:
>
>> On 6/28/20 2:03 AM, Corentin Jabot wrote:
>>
>>
>>
>> On Sun, 28 Jun 2020 at 07:37, Corentin Jabot <corentinjabot_at_[hidden]>
>> wrote:
>>
>>>
>>>
>>> On Sun, Jun 28, 2020, 06:50 Tom Honermann via SG16 <
>>> sg16_at_[hidden]> wrote:
>>>
>>>> A new draft revision of P2029 (Proposed resolution for core issues 411,
>>>> 1656, and 2333; numeric and universal character escapes in character and
>>>> string literals) is now available at
>>>> https://rawgit.com/sg16-unicode/sg16/master/papers/d2029r2.html. This
>>>> addresses the CWG feedback provided during the March 23rd, 2020 core
>>>> issues processing teleconference
>>>> <http://wiki.edg.com/bin/view/Wg21summer2020/IssuesProcessingTeleconference2020-03-23>
>>>> .
>>>>
>>>> Wording review feedback prior to the next Core issues processing
>>>> teleconference would be much appreciated!
>>>>
>>> I really like the overall direction, a few comments:
>>> - Can we not make conditionally supported escape sequences part of the
>>> grammar?
>>>
>> This was requested by Core in the 2020-01-16 issues processing telecon
>> <https://wiki.edg.com/bin/view/Wg21prague/IssuesProcessingTeleconference2020-01-16>
>> .
>>
>>
>> What I would do:
>> simple-escape-sequence:
>> any member of the basic source character set other than u, U, x, and
>> the members of octal-digit
>>
>> And in 5.13, keep
>> Escape sequences not listed in Table 9 are conditionally supported, with
>> implementation-defined semantics
>>
>> What problem would that solve?
>>
>
> Not having separated grammar for non standard features, simpler grammar.
>
> I prefer the current approach in the paper, but I have no objection to
> doing what you suggest if the CWG expresses such a preference.
>
>
>
>>
>>
>>> - Can we not add notes for stateful encodings? It doesn't add anything.
>>>
>> Stateful encodings were discussed in the 2020-03-23 issues processing
>> telecon
>> <https://wiki.edg.com/bin/view/Wg21summer2020/IssuesProcessingTeleconference2020-03-23>
>> .
>>
>
> Sure, it is still a level of detail that doesn't add anything. I would
> like to avoid people in 30 years wondering why that this sentences are here.
>
> Stateful encodings are still a thing. They may still be a thing in 30
> years.
>
I am not saying they aren't and wouldn't be, I am saying that the current
wording was enough for that to be implemented correctly while the new
wording does not.

> -- Wide multi character literals were not a thing, let's not make them one
>>> now. same for conditional character literals and conditional wide
>>> character literals.
>>>
>>> Instead, please add text in (Z) to describe them?
>>> ie:
>>>
>>> -ordinary and wide characters literal consisting of a single
>>> basic-c-char, simple-escape-sequence, or universal-character-name that
>>> specifies a character that either lacks representation in the associated
>>> character encoding or that cannot be encoded as a single code unit
>>> are conditionally supported and have an implementation-defined value
>>> - A wide character literal consisting of multiple c-chars is
>>> conditionally-supported and has an implementation-defined value.
>>>
>> Giving these odd literals a name was suggested by Core. I agree with
>> their suggested direction; giving them a name makes it easier to discuss
>> and define them.
>>
>
>
> No, especially wide multi characters that are simply not a thing, let's
> not make them one. The reason multi character literals exists and have a
> name is because their type is different from character literals.
>
> They are a thing in C (see WG14 N2176 (the final draft WP before C18)
> 6.4.4.4, "Character constants", p11). I believe their omission in C++ is
> just an oversight. Compilers support them. I think they are a thing and
> giving them a name is useful.
>
> They don't have a name in C either

> Should I send a mail to core? Because I really do not like that direction.
> (Especially as what you call wide multi character literal doesn't behave at
> all as multi character literals). We should also look at making them ill
> formed rather than giving them a name
>
> Arguably, you have already sent that mail to Core :)
>
Haha indeed, nice :)

> I don't know what behavioral difference you are concerned about. The
> primary reason for differentiating them is to allow the multicharacter case
> to be ill-formed (conditionally-supported) and/or to have an encoding that
> differs from single c-char literals.
>
> I think the standard should reflect existing practice. These odd literals
> are supported in common compilers. If you would like to make them
> ill-formed, you are certainly free to write a paper, but implementations
> are already free to make them ill-formed and I suspect the ones that don't
> would retain support for them as an extension anyway.
>
I am very concerned about giving names to anti features that didn't have a
name for the past 30 years, especially those that are not used and were
previously not a thing in C++ ( i guess we disagree on our reading on the C
standard). I am not concerned about behavior changes
Describing them in a bullet point, rather than in this table keep the table
readable and meaning full and leave us with the following features:

* ordinary/wide/utf character literal
* multi character literal

Which is then mostly symmetric with the table for strings.
The bullet point then describes these odds behaviors, which again is not a
behavior change. We are just talking about naming and presentation.

(multi character literal does need a name, both because it always had one,
and because it has a different type - also it is also somewhat used)

>
>>>
>>> Please change
>>> The sequence of characters denoted by each contiguous sequence of
>>> basic-s-chars, r-chars, simple-escape-sequences ([lex.ccon]), and
>>> universal-character-names ([lex.charset]) is encoded to a code unit sequence
>>> To
>>> Each basic-s-chars, r-chars, simple-escape-sequences ([lex.ccon]), and
>>> universal-character-names ([lex.charset]) is encoded to a code unit sequence
>>>
>> The intent is to make it clear that these sequences are encoded as a
>> group. This is necessary for stateful encodings with SI/SO characters
>> since such characters don't necessarily contribute a code unit sequence on
>> their own. This was also requested during the 2020-03-23 issues
>> processing telecon
>> <https://wiki.edg.com/bin/view/Wg21summer2020/IssuesProcessingTeleconference2020-03-23>
>> .
>>
>
> The effect is that I can encode things like e,U+0301 as a single code
> unit, which at the very least should not be allowed in a wording change.
>
> Please read the wording again. I don't think it states that. If you
> still think it does, please elaborate in detail.
>

 You use the term character ( which in this context is synonym of abstract
character)
 The sequence of *characters* denoted by *each contiguous sequence *of
basic-s-chars, r-chars, simple-escape-sequences ([lex.ccon]), and
universal-character-names ([lex.charset]) is encoded to a code unit
sequence using the string-literal's associated character encoding. If a
*character* lacks [...]

Maybe : Each codepoint denoted by a single basic-s-chars, r-chars,
simple-escape-sequences ([lex.ccon]), and universal-character-names is
encoded to a code unit sequence using the string-literal's associated
character encoding. If that codepoint lacks representation in the
associated character encoding,

Note that codepoint isn't particularly meaningful in this context , could
be "element", for example. The point is the sequence is not converted as a
whole.
Changing that is design ( I don't have a terribly strong opinion either
way, but it needs to be discussed outside of core, notably because it would
allow implementation to handle combining characters differently).

> It's also a terrible reason as c-char and UCNs are Unicode characters at
> this point and cannot correspond to a statefull character as the source of
> the conversation. The thing they are converted to being an implementation
> defined sequencee of code unit, the possibility of a state shift is implied.
>
> What are you referring to as a "terrible reason"?
>
That
> The intent is to make it clear that these sequences are encoded as a
group. This is necessary for stateful encodings with SI/SO characters
since such characters don't necessarily contribute a code unit sequence on
their own

Either:
 - These characters appear as ucn and they should in fact contribute to a
code unit sequence
 - They are used as part of a stateful source encoding and would have not
been conserved past phase 1.

> SI/SO characters exist in Unicode and can therefore be represented as
> UCNs. In translation phase 5, an implementation can treat them as part of
> a shift sequence when converting to the execution encoding.
>
Again that is a design change

>
>>>
>>>
>>> - please replace applicable character encoding by character encoding
>>>
>> That doesn't seem correct to me; the wording needs to indicate which
>> character encoding. Note that there are three occurrences of "applicable
>> associated character encoding"; I'm not sure which use you were referring
>> to.
>>
>
> Missed a word. Sorry. Meant associated character encoding. "Applicable
> associated" doesn't add anything. Maybe the "the literal associated
> encoding"
>
> That says the same thing to me. If CWG expresses a preference, I'll
> change it.
>
Yes it does, just trying to be consistent in terminology. associated
literal encoding is consistent and what sg16 has been using (including you,
maybe you came up with that term :p)

> Tom.
>



SG16 list run by sg16-owner@lists.isocpp.org