sg16: Re: [SG16] [isocpp-core] New draft revision: D2029R2 (Proposed resolution for core issues 411, 1656, and 2333; numeric and universal character escapes in character and string literals)

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Tue, 30 Jun 2020 18:39:48 +0200

On Tue, 30 Jun 2020 at 18:19, Tom Honermann <tom_at_[hidden]> wrote:

> On 6/30/20 10:49 AM, Corentin Jabot via Core wrote:
>
>
>
> On Tue, Jun 30, 2020, 16:32 Tom Honermann <tom_at_[hidden]> wrote:
>
>> On 6/30/20 1:31 AM, Corentin Jabot wrote:
>>
>>
>>
>> On Tue, 30 Jun 2020 at 06:49, Tom Honermann <tom_at_[hidden]> wrote:
>>
>>> On 6/30/20 12:15 AM, Corentin Jabot wrote:
>>>
>>>
>>>
>>> On Tue, Jun 30, 2020, 05:52 Tom Honermann <tom_at_[hidden]> wrote:
>>>
>>>> On 6/28/20 2:03 AM, Corentin Jabot wrote:
>>>>
>>>>
>>>>
>>>> On Sun, 28 Jun 2020 at 07:37, Corentin Jabot <corentinjabot_at_[hidden]>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Sun, Jun 28, 2020, 06:50 Tom Honermann via SG16 <
>>>>> sg16_at_[hidden]> wrote:
>>>>>
>>>>>> A new draft revision of P2029 (Proposed resolution for core issues
>>>>>> 411, 1656, and 2333; numeric and universal character escapes in character
>>>>>> and string literals) is now available at
>>>>>> https://rawgit.com/sg16-unicode/sg16/master/papers/d2029r2.html.
>>>>>> This addresses the CWG feedback provided during the March 23rd, 2020
>>>>>> core issues processing teleconference
>>>>>> <http://wiki.edg.com/bin/view/Wg21summer2020/IssuesProcessingTeleconference2020-03-23>
>>>>>> .
>>>>>>
>>>>>> Wording review feedback prior to the next Core issues processing
>>>>>> teleconference would be much appreciated!
>>>>>>
>>>>> I really like the overall direction, a few comments:
>>>>> - Can we not make conditionally supported escape sequences part of the
>>>>> grammar?
>>>>>
>>>> This was requested by Core in the 2020-01-16 issues processing telecon
>>>> <https://wiki.edg.com/bin/view/Wg21prague/IssuesProcessingTeleconference2020-01-16>
>>>> .
>>>>
>>>>
>>>> What I would do:
>>>> simple-escape-sequence:
>>>> any member of the basic source character set other than u, U, x,
>>>> and the members of octal-digit
>>>>
>>>> And in 5.13, keep
>>>> Escape sequences not listed in Table 9 are conditionally supported,
>>>> with implementation-defined semantics
>>>>
>>>> What problem would that solve?
>>>>
>>>
>>> Not having separated grammar for non standard features, simpler grammar.
>>>
>>> I prefer the current approach in the paper, but I have no objection to
>>> doing what you suggest if the CWG expresses such a preference.
>>>
>>>
>>>
>>>>
>>>>
>>>>> - Can we not add notes for stateful encodings? It doesn't add anything.
>>>>>
>>>> Stateful encodings were discussed in the 2020-03-23 issues processing
>>>> telecon
>>>> <https://wiki.edg.com/bin/view/Wg21summer2020/IssuesProcessingTeleconference2020-03-23>
>>>> .
>>>>
>>>
>>> Sure, it is still a level of detail that doesn't add anything. I would
>>> like to avoid people in 30 years wondering why that this sentences are here.
>>>
>>> Stateful encodings are still a thing. They may still be a thing in 30
>>> years.
>>>
>> I am not saying they aren't and wouldn't be, I am saying that the current
>> wording was enough for that to be implemented correctly while the new
>> wording does not.
>>
>> I'm not following. What do you believe the new wording changes?
>> Discussion of stateful encodings is limited to non-normative notes.
>>
>>
>>
>>> -- Wide multi character literals were not a thing, let's not make them
>>>>> one now. same for conditional character literals and conditional wide
>>>>> character literals.
>>>>>
>>>>> Instead, please add text in (Z) to describe them?
>>>>> ie:
>>>>>
>>>>> -ordinary and wide characters literal consisting of a single
>>>>> basic-c-char, simple-escape-sequence, or universal-character-name that
>>>>> specifies a character that either lacks representation in the associated
>>>>> character encoding or that cannot be encoded as a single code unit
>>>>> are conditionally supported and have an implementation-defined value
>>>>> - A wide character literal consisting of multiple c-chars is
>>>>> conditionally-supported and has an implementation-defined value.
>>>>>
>>>> Giving these odd literals a name was suggested by Core. I agree with
>>>> their suggested direction; giving them a name makes it easier to discuss
>>>> and define them.
>>>>
>>>
>>>
>>> No, especially wide multi characters that are simply not a thing, let's
>>> not make them one. The reason multi character literals exists and have a
>>> name is because their type is different from character literals.
>>>
>>> They are a thing in C (see WG14 N2176 (the final draft WP before C18)
>>> 6.4.4.4, "Character constants", p11). I believe their omission in C++ is
>>> just an oversight. Compilers support them. I think they are a thing and
>>> giving them a name is useful.
>>>
>>> They don't have a name in C either
>>
>> I don't see how giving them a name is in any way detrimental.
>>
>>
>>
>>> Should I send a mail to core? Because I really do not like that
>>> direction. (Especially as what you call wide multi character literal
>>> doesn't behave at all as multi character literals). We should also look at
>>> making them ill formed rather than giving them a name
>>>
>>> Arguably, you have already sent that mail to Core :)
>>>
>> Haha indeed, nice :)
>>
>>
>>> I don't know what behavioral difference you are concerned about. The
>>> primary reason for differentiating them is to allow the multicharacter case
>>> to be ill-formed (conditionally-supported) and/or to have an encoding that
>>> differs from single c-char literals.
>>>
>>> I think the standard should reflect existing practice. These odd
>>> literals are supported in common compilers. If you would like to make them
>>> ill-formed, you are certainly free to write a paper, but implementations
>>> are already free to make them ill-formed and I suspect the ones that don't
>>> would retain support for them as an extension anyway.
>>>
>> I am very concerned about giving names to anti features that didn't have
>> a name for the past 30 years, especially those that are not used and were
>> previously not a thing in C++ ( i guess we disagree on our reading on the C
>> standard). I am not concerned about behavior changes
>> Describing them in a bullet point, rather than in this table keep the
>> table readable and meaning full and leave us with the following features:
>>
>> * ordinary/wide/utf character literal
>> * multi character literal
>>
>> Which is then mostly symmetric with the table for strings.
>> The bullet point then describes these odds behaviors, which again is not
>> a behavior change. We are just talking about naming and presentation.
>>
>> (multi character literal does need a name, both because it always had
>> one, and because it has a different type - also it is also somewhat used)
>>
>> Whether you consider these literal kinds anti-features is not relevant.
>> Again, I don't see how giving them names is a bad thing; it doesn't
>> legitimize them any more than describing them in a bullet list does. A
>> name can acquire both positive and negative connotations. The current
>> direction in the paper reflects CWG guidance. If you would like to
>> participate in the next CWG review of this paper and argue your position,
>> please feel free. I don't intend to argue it further in this email thread.
>>
>>
>>>>>
>>>>> Please change
>>>>> The sequence of characters denoted by each contiguous sequence of
>>>>> basic-s-chars, r-chars, simple-escape-sequences ([lex.ccon]), and
>>>>> universal-character-names ([lex.charset]) is encoded to a code unit sequence
>>>>> To
>>>>> Each basic-s-chars, r-chars, simple-escape-sequences ([lex.ccon]), and
>>>>> universal-character-names ([lex.charset]) is encoded to a code unit sequence
>>>>>
>>>> The intent is to make it clear that these sequences are encoded as a
>>>> group. This is necessary for stateful encodings with SI/SO characters
>>>> since such characters don't necessarily contribute a code unit sequence on
>>>> their own. This was also requested during the 2020-03-23 issues
>>>> processing telecon
>>>> <https://wiki.edg.com/bin/view/Wg21summer2020/IssuesProcessingTeleconference2020-03-23>
>>>> .
>>>>
>>>
>>> The effect is that I can encode things like e,U+0301 as a single code
>>> unit, which at the very least should not be allowed in a wording change.
>>>
>>> Please read the wording again. I don't think it states that. If you
>>> still think it does, please elaborate in detail.
>>>
>>
>> You use the term character ( which in this context is synonym of
>> abstract character)
>> The sequence of *characters* denoted by *each contiguous sequence *of
>> basic-s-chars, r-chars, simple-escape-sequences ([lex.ccon]), and
>> universal-character-names ([lex.charset]) is encoded to a code unit
>> sequence using the string-literal's associated character encoding. If a
>> *character* lacks [...]
>>
>> Maybe : Each codepoint denoted by a single basic-s-chars, r-chars,
>> simple-escape-sequences ([lex.ccon]), and universal-character-names is
>> encoded to a code unit sequence using the string-literal's associated
>> character encoding. If that codepoint lacks representation in the
>> associated character encoding,
>>
>> Note that codepoint isn't particularly meaningful in this context , could
>> be "element", for example. The point is the sequence is not converted as a
>> whole.
>> Changing that is design ( I don't have a terribly strong opinion either
>> way, but it needs to be discussed outside of core, notably because it would
>> allow implementation to handle combining characters differently).
>>
>> Again, that wording reflects prior CWG guidance. Since the (wide)
>> execution encoding is implementation-defined, I don't agree that this
>> reflects a design change, but I defer to our benevolent CWG chair to make
>> that determination as he sees fit.
>>
>
> Please understand that none of my comments are addressed at you , but
> rather at the proposed wording, which I understand is following cwg
> guidance. I am disagreeing with that guidance
>
>>
>>
>>
>>
>>> It's also a terrible reason as c-char and UCNs are Unicode characters at
>>> this point and cannot correspond to a statefull character as the source of
>>> the conversation. The thing they are converted to being an implementation
>>> defined sequencee of code unit, the possibility of a state shift is implied.
>>>
>>> What are you referring to as a "terrible reason"?
>>>
>> That
>> > The intent is to make it clear that these sequences are encoded as a
>> group. This is necessary for stateful encodings with SI/SO characters
>> since such characters don't necessarily contribute a code unit sequence on
>> their own
>>
>> Either:
>> - These characters appear as ucn and they should in fact contribute to
>> a code unit sequence
>> - They are used as part of a stateful source encoding and would have not
>> been conserved past phase 1.
>>
>> In discussion elsewhere, we've discussed the example of a decomposed 'é'
>> and noted that at least some compilers will encode the 'e' and combining
>> acute separately, perhaps substituting a character for the combining acute
>> if the execution character set doesn't support combining characters as
>> distinct characters. Unless someone has proven otherwise, I believe it
>> would be conforming for an implementation to convert the decomposed code
>> point sequence to a composed character that is representable in the
>> execution encoding (e.g., ISO-8859-1; noting that such normalization is not
>> desirable for Unicode encodings). I see no reason to be more restrictive
>> here.
>>
>
>
> We have established elsewhere that implementation can't currently do that
> in phased 5 and that they should not be allowed to (we didn't poll that ,
> but either ways this paper changes the status quo)
>
> I don't agree that has been established. If you feel otherwise, please
> provide a reference. I don't think this changes the status quo since both
> translation phases 1 and 5 are implementation-defined (and I believe it has
> been established that implementations are granted considerable freedoms
> here; e.g., trigraphs and implementations that map "private" to "public").
>

I am only talking about phase 5, not phase 1.
Yes, an implementation can compose or decompose at will *in phase 1*
The importants bits for phase 5 are mostly here h
ttp://eel.is/c++draft/lex.string#13 <http://eel.is/c++draft/lex.string#13>

> Tom.
>
> SI/SO characters exist in Unicode and can therefore be represented as
>>> UCNs. In translation phase 5, an implementation can treat them as part of
>>> a shift sequence when converting to the execution encoding.
>>>
>> Again that is a design change
>>
>> Per prior comments, I disagree and trust that the CWG chair will make an
>> appropriate determination about this.
>>
>
> Sure. I'd be happy to participate to the review, let me know!
>
>>
>>
>>>
>>>>>
>>>>>
>>>>> - please replace applicable character encoding by character encoding
>>>>>
>>>> That doesn't seem correct to me; the wording needs to indicate which
>>>> character encoding. Note that there are three occurrences of "applicable
>>>> associated character encoding"; I'm not sure which use you were referring
>>>> to.
>>>>
>>>
>>> Missed a word. Sorry. Meant associated character encoding. "Applicable
>>> associated" doesn't add anything. Maybe the "the literal associated
>>> encoding"
>>>
>>> That says the same thing to me. If CWG expresses a preference, I'll
>>> change it.
>>>
>> Yes it does, just trying to be consistent in terminology. associated
>> literal encoding is consistent and what sg16 has been using (including you,
>> maybe you came up with that term :p)
>>
>> There are multiple "associated character encodings" to choose from;
>> "applicable" indicates which one to choose. Again, if the CWG requests a
>> change, I'll change it.
>>
>> Tom.
>>
>>
>>
>>> Tom.
>>>
>>
>>
> _______________________________________________
> Core mailing listCore_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
> Link to this post: http://lists.isocpp.org/core/2020/06/9469.php
>
>
>

Received on 2020-06-30 11:43:14