C++ Logo

SG16

Advanced search

Subject: Re: New draft revision: D2029R2 (Proposed resolution for core issues 411, 1656, and 2333; numeric and universal character escapes in character and string literals)
From: Corentin Jabot (corentinjabot_at_[hidden])
Date: 2020-06-30 09:49:29


On Tue, Jun 30, 2020, 16:32 Tom Honermann <tom_at_[hidden]> wrote:

> On 6/30/20 1:31 AM, Corentin Jabot wrote:
>
>
>
> On Tue, 30 Jun 2020 at 06:49, Tom Honermann <tom_at_[hidden]> wrote:
>
>> On 6/30/20 12:15 AM, Corentin Jabot wrote:
>>
>>
>>
>> On Tue, Jun 30, 2020, 05:52 Tom Honermann <tom_at_[hidden]> wrote:
>>
>>> On 6/28/20 2:03 AM, Corentin Jabot wrote:
>>>
>>>
>>>
>>> On Sun, 28 Jun 2020 at 07:37, Corentin Jabot <corentinjabot_at_[hidden]>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Sun, Jun 28, 2020, 06:50 Tom Honermann via SG16 <
>>>> sg16_at_[hidden]> wrote:
>>>>
>>>>> A new draft revision of P2029 (Proposed resolution for core issues
>>>>> 411, 1656, and 2333; numeric and universal character escapes in character
>>>>> and string literals) is now available at
>>>>> https://rawgit.com/sg16-unicode/sg16/master/papers/d2029r2.html.
>>>>> This addresses the CWG feedback provided during the March 23rd, 2020
>>>>> core issues processing teleconference
>>>>> <http://wiki.edg.com/bin/view/Wg21summer2020/IssuesProcessingTeleconference2020-03-23>
>>>>> .
>>>>>
>>>>> Wording review feedback prior to the next Core issues processing
>>>>> teleconference would be much appreciated!
>>>>>
>>>> I really like the overall direction, a few comments:
>>>> - Can we not make conditionally supported escape sequences part of the
>>>> grammar?
>>>>
>>> This was requested by Core in the 2020-01-16 issues processing telecon
>>> <https://wiki.edg.com/bin/view/Wg21prague/IssuesProcessingTeleconference2020-01-16>
>>> .
>>>
>>>
>>> What I would do:
>>> simple-escape-sequence:
>>> any member of the basic source character set other than u, U, x, and
>>> the members of octal-digit
>>>
>>> And in 5.13, keep
>>> Escape sequences not listed in Table 9 are conditionally supported, with
>>> implementation-defined semantics
>>>
>>> What problem would that solve?
>>>
>>
>> Not having separated grammar for non standard features, simpler grammar.
>>
>> I prefer the current approach in the paper, but I have no objection to
>> doing what you suggest if the CWG expresses such a preference.
>>
>>
>>
>>>
>>>
>>>> - Can we not add notes for stateful encodings? It doesn't add anything.
>>>>
>>> Stateful encodings were discussed in the 2020-03-23 issues processing
>>> telecon
>>> <https://wiki.edg.com/bin/view/Wg21summer2020/IssuesProcessingTeleconference2020-03-23>
>>> .
>>>
>>
>> Sure, it is still a level of detail that doesn't add anything. I would
>> like to avoid people in 30 years wondering why that this sentences are here.
>>
>> Stateful encodings are still a thing. They may still be a thing in 30
>> years.
>>
> I am not saying they aren't and wouldn't be, I am saying that the current
> wording was enough for that to be implemented correctly while the new
> wording does not.
>
> I'm not following. What do you believe the new wording changes?
> Discussion of stateful encodings is limited to non-normative notes.
>
>
>
>> -- Wide multi character literals were not a thing, let's not make them
>>>> one now. same for conditional character literals and conditional wide
>>>> character literals.
>>>>
>>>> Instead, please add text in (Z) to describe them?
>>>> ie:
>>>>
>>>> -ordinary and wide characters literal consisting of a single
>>>> basic-c-char, simple-escape-sequence, or universal-character-name that
>>>> specifies a character that either lacks representation in the associated
>>>> character encoding or that cannot be encoded as a single code unit
>>>> are conditionally supported and have an implementation-defined value
>>>> - A wide character literal consisting of multiple c-chars is
>>>> conditionally-supported and has an implementation-defined value.
>>>>
>>> Giving these odd literals a name was suggested by Core. I agree with
>>> their suggested direction; giving them a name makes it easier to discuss
>>> and define them.
>>>
>>
>>
>> No, especially wide multi characters that are simply not a thing, let's
>> not make them one. The reason multi character literals exists and have a
>> name is because their type is different from character literals.
>>
>> They are a thing in C (see WG14 N2176 (the final draft WP before C18)
>> 6.4.4.4, "Character constants", p11). I believe their omission in C++ is
>> just an oversight. Compilers support them. I think they are a thing and
>> giving them a name is useful.
>>
>> They don't have a name in C either
>
> I don't see how giving them a name is in any way detrimental.
>
>
>
>> Should I send a mail to core? Because I really do not like that
>> direction. (Especially as what you call wide multi character literal
>> doesn't behave at all as multi character literals). We should also look at
>> making them ill formed rather than giving them a name
>>
>> Arguably, you have already sent that mail to Core :)
>>
> Haha indeed, nice :)
>
>
>> I don't know what behavioral difference you are concerned about. The
>> primary reason for differentiating them is to allow the multicharacter case
>> to be ill-formed (conditionally-supported) and/or to have an encoding that
>> differs from single c-char literals.
>>
>> I think the standard should reflect existing practice. These odd
>> literals are supported in common compilers. If you would like to make them
>> ill-formed, you are certainly free to write a paper, but implementations
>> are already free to make them ill-formed and I suspect the ones that don't
>> would retain support for them as an extension anyway.
>>
> I am very concerned about giving names to anti features that didn't have a
> name for the past 30 years, especially those that are not used and were
> previously not a thing in C++ ( i guess we disagree on our reading on the C
> standard). I am not concerned about behavior changes
> Describing them in a bullet point, rather than in this table keep the
> table readable and meaning full and leave us with the following features:
>
> * ordinary/wide/utf character literal
> * multi character literal
>
> Which is then mostly symmetric with the table for strings.
> The bullet point then describes these odds behaviors, which again is not a
> behavior change. We are just talking about naming and presentation.
>
> (multi character literal does need a name, both because it always had one,
> and because it has a different type - also it is also somewhat used)
>
> Whether you consider these literal kinds anti-features is not relevant.
> Again, I don't see how giving them names is a bad thing; it doesn't
> legitimize them any more than describing them in a bullet list does. A
> name can acquire both positive and negative connotations. The current
> direction in the paper reflects CWG guidance. If you would like to
> participate in the next CWG review of this paper and argue your position,
> please feel free. I don't intend to argue it further in this email thread.
>
>
>>>>
>>>> Please change
>>>> The sequence of characters denoted by each contiguous sequence of
>>>> basic-s-chars, r-chars, simple-escape-sequences ([lex.ccon]), and
>>>> universal-character-names ([lex.charset]) is encoded to a code unit sequence
>>>> To
>>>> Each basic-s-chars, r-chars, simple-escape-sequences ([lex.ccon]), and
>>>> universal-character-names ([lex.charset]) is encoded to a code unit sequence
>>>>
>>> The intent is to make it clear that these sequences are encoded as a
>>> group. This is necessary for stateful encodings with SI/SO characters
>>> since such characters don't necessarily contribute a code unit sequence on
>>> their own. This was also requested during the 2020-03-23 issues
>>> processing telecon
>>> <https://wiki.edg.com/bin/view/Wg21summer2020/IssuesProcessingTeleconference2020-03-23>
>>> .
>>>
>>
>> The effect is that I can encode things like e,U+0301 as a single code
>> unit, which at the very least should not be allowed in a wording change.
>>
>> Please read the wording again. I don't think it states that. If you
>> still think it does, please elaborate in detail.
>>
>
> You use the term character ( which in this context is synonym of abstract
> character)
> The sequence of *characters* denoted by *each contiguous sequence *of
> basic-s-chars, r-chars, simple-escape-sequences ([lex.ccon]), and
> universal-character-names ([lex.charset]) is encoded to a code unit
> sequence using the string-literal's associated character encoding. If a
> *character* lacks [...]
>
> Maybe : Each codepoint denoted by a single basic-s-chars, r-chars,
> simple-escape-sequences ([lex.ccon]), and universal-character-names is
> encoded to a code unit sequence using the string-literal's associated
> character encoding. If that codepoint lacks representation in the
> associated character encoding,
>
> Note that codepoint isn't particularly meaningful in this context , could
> be "element", for example. The point is the sequence is not converted as a
> whole.
> Changing that is design ( I don't have a terribly strong opinion either
> way, but it needs to be discussed outside of core, notably because it would
> allow implementation to handle combining characters differently).
>
> Again, that wording reflects prior CWG guidance. Since the (wide)
> execution encoding is implementation-defined, I don't agree that this
> reflects a design change, but I defer to our benevolent CWG chair to make
> that determination as he sees fit.
>

Please understand that none of my comments are addressed at you , but
rather at the proposed wording, which I understand is following cwg
guidance. I am disagreeing with that guidance

>
>
>
>
>> It's also a terrible reason as c-char and UCNs are Unicode characters at
>> this point and cannot correspond to a statefull character as the source of
>> the conversation. The thing they are converted to being an implementation
>> defined sequencee of code unit, the possibility of a state shift is implied.
>>
>> What are you referring to as a "terrible reason"?
>>
> That
> > The intent is to make it clear that these sequences are encoded as a
> group. This is necessary for stateful encodings with SI/SO characters
> since such characters don't necessarily contribute a code unit sequence on
> their own
>
> Either:
> - These characters appear as ucn and they should in fact contribute to a
> code unit sequence
> - They are used as part of a stateful source encoding and would have not
> been conserved past phase 1.
>
> In discussion elsewhere, we've discussed the example of a decomposed 'é'
> and noted that at least some compilers will encode the 'e' and combining
> acute separately, perhaps substituting a character for the combining acute
> if the execution character set doesn't support combining characters as
> distinct characters. Unless someone has proven otherwise, I believe it
> would be conforming for an implementation to convert the decomposed code
> point sequence to a composed character that is representable in the
> execution encoding (e.g., ISO-8859-1; noting that such normalization is not
> desirable for Unicode encodings). I see no reason to be more restrictive
> here.
>

We have established elsewhere that implementation can't currently do that
in phased 5 and that they should not be allowed to (we didn't poll that ,
but either ways this paper changes the status quo)

> SI/SO characters exist in Unicode and can therefore be represented as
>> UCNs. In translation phase 5, an implementation can treat them as part of
>> a shift sequence when converting to the execution encoding.
>>
> Again that is a design change
>
> Per prior comments, I disagree and trust that the CWG chair will make an
> appropriate determination about this.
>

Sure. I'd be happy to participate to the review, let me know!

>
>
>>
>>>>
>>>>
>>>> - please replace applicable character encoding by character encoding
>>>>
>>> That doesn't seem correct to me; the wording needs to indicate which
>>> character encoding. Note that there are three occurrences of "applicable
>>> associated character encoding"; I'm not sure which use you were referring
>>> to.
>>>
>>
>> Missed a word. Sorry. Meant associated character encoding. "Applicable
>> associated" doesn't add anything. Maybe the "the literal associated
>> encoding"
>>
>> That says the same thing to me. If CWG expresses a preference, I'll
>> change it.
>>
> Yes it does, just trying to be consistent in terminology. associated
> literal encoding is consistent and what sg16 has been using (including you,
> maybe you came up with that term :p)
>
> There are multiple "associated character encodings" to choose from;
> "applicable" indicates which one to choose. Again, if the CWG requests a
> change, I'll change it.
>
> Tom.
>
>
>
>> Tom.
>>
>
>



SG16 list run by sg16-owner@lists.isocpp.org