sg16: Re: [SG16] New draft revision: D2029R2 (Proposed resolution for core issues 411, 1656, and 2333; numeric and universal character escapes in character and string literals)

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 30 Jun 2020 10:32:42 -0400

On 6/30/20 1:31 AM, Corentin Jabot wrote:
>
>
> On Tue, 30 Jun 2020 at 06:49, Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> On 6/30/20 12:15 AM, Corentin Jabot wrote:
>>
>>
>> On Tue, Jun 30, 2020, 05:52 Tom Honermann <tom_at_[hidden]
>> <mailto:tom_at_[hidden]>> wrote:
>>
>> On 6/28/20 2:03 AM, Corentin Jabot wrote:
>>>
>>>
>>> On Sun, 28 Jun 2020 at 07:37, Corentin Jabot
>>> <corentinjabot_at_[hidden] <mailto:corentinjabot_at_[hidden]>>
>>> wrote:
>>>
>>>
>>>
>>> On Sun, Jun 28, 2020, 06:50 Tom Honermann via SG16
>>> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>>
>>> wrote:
>>>
>>> A new draft revision of P2029 (Proposed resolution
>>> for core issues 411, 1656, and 2333; numeric and
>>> universal character escapes in character and string
>>> literals) is now available at
>>> https://rawgit.com/sg16-unicode/sg16/master/papers/d2029r2.html.
>>> This addresses the CWG feedback provided during the
>>> March 23rd, 2020 core issues processing
>>> teleconference
>>> <http://wiki.edg.com/bin/view/Wg21summer2020/IssuesProcessingTeleconference2020-03-23>.
>>>
>>> Wording review feedback prior to the next Core
>>> issues processing teleconference would be much
>>> appreciated!
>>>
>>> I really like the overall direction, a few comments:
>>> - Can we not make conditionally supported escape
>>> sequences part of the grammar?
>>>
>> This was requested by Core in the 2020-01-16 issues
>> processing telecon
>> <https://wiki.edg.com/bin/view/Wg21prague/IssuesProcessingTeleconference2020-01-16>.
>>>
>>> What I would do:
>>> simple-escape-sequence:
>>> any member of the basic source character set other than
>>> u, U, x, and the members of octal-digit
>>>
>>> And in 5.13, keep
>>> Escape sequences not listed in Table 9 are conditionally
>>> supported, with implementation-defined semantics
>> What problem would that solve?
>>
>>
>> Not having separated grammar for non standard features, simpler
>> grammar.
> I prefer the current approach in the paper, but I have no
> objection to doing what you suggest if the CWG expresses such a
> preference.
>>
>>
>>> - Can we not add notes for stateful encodings? It
>>> doesn't add anything.
>>>
>> Stateful encodings were discussed in the 2020-03-23 issues
>> processing telecon
>> <https://wiki.edg.com/bin/view/Wg21summer2020/IssuesProcessingTeleconference2020-03-23>.
>>
>>
>>
>> Sure, it is still a level of detail that doesn't add anything. I
>> would like to avoid people in 30 years wondering why that this
>> sentences are here.
> Stateful encodings are still a thing. They may still be a thing
> in 30 years.
>
> I am not saying they aren't and wouldn't be, I am saying that the
> current wording was enough for that to be implemented correctly while
> the new wording does not.
I'm not following. What do you believe the new wording changes?
Discussion of stateful encodings is limited to non-normative notes.
>
>>> -- Wide multi character literals were not a thing, let's
>>> not make them one now. same for conditional character
>>> literals and conditional wide character literals.
>>>
>>> Instead, please add text in (Z) to describe them?
>>> ie:
>>>
>>> -ordinary and wide characters literal consisting of a
>>> single basic-c-char, simple-escape-sequence, or
>>> universal-character-name that specifies a character that
>>> either lacks representation in the associated character
>>> encoding or that cannot be encoded as a single code unit
>>> are conditionally supported and have an
>>> implementation-defined value
>>> - A wide character literal consisting of multiple
>>> c-chars is conditionally-supported and has an
>>> implementation-defined value.
>>>
>> Giving these odd literals a name was suggested by Core. I
>> agree with their suggested direction; giving them a name
>> makes it easier to discuss and define them.
>>
>>
>>
>> No, especially wide multi characters that are simply not a thing,
>> let's not make them one. The reason multi character literals
>> exists and have a name is because their type is different from
>> character literals.
> They are a thing in C (see WG14 N2176 (the final draft WP before
> C18) 6.4.4.4, "Character constants", p11). I believe their
> omission in C++ is just an oversight. Compilers support them. I
> think they are a thing and giving them a name is useful.
>
> They don't have a name in C either
I don't see how giving them a name is in any way detrimental.
>
>> Should I send a mail to core? Because I really do not like that
>> direction. (Especially as what you call wide multi character
>> literal doesn't behave at all as multi character literals). We
>> should also look at making them ill formed rather than giving
>> them a name
>
> Arguably, you have already sent that mail to Core :)
>
> Haha indeed, nice :)
>
> I don't know what behavioral difference you are concerned about.
> The primary reason for differentiating them is to allow the
> multicharacter case to be ill-formed (conditionally-supported)
> and/or to have an encoding that differs from single c-char literals.
>
> I think the standard should reflect existing practice. These odd
> literals are supported in common compilers. If you would like to
> make them ill-formed, you are certainly free to write a paper, but
> implementations are already free to make them ill-formed and I
> suspect the ones that don't would retain support for them as an
> extension anyway.
>
> I am very concerned about giving names to anti features that didn't
> have a name for the past 30 years, especially those that are not used
> and were previously not a thing in C++ ( i guess we disagree on our
> reading on the C standard). I am not concerned about behavior changes
> Describing them in a bullet point, rather than in this table keep the
> table readable and meaning full and leave us with the following features:
>
> * ordinary/wide/utf character literal
> * multi character literal
>
> Which is then mostly symmetric with the table for strings.
> The bullet point then describes these odds behaviors, which again is
> not a behavior change. We are just talking about naming and presentation.
>
> (multi character literal does need a name, both because it always had
> one, and because it has a different type - also it is also somewhat used)
Whether you consider these literal kinds anti-features is not relevant.
Again, I don't see how giving them names is a bad thing; it doesn't
legitimize them any more than describing them in a bullet list does. A
name can acquire both positive and negative connotations. The current
direction in the paper reflects CWG guidance. If you would like to
participate in the next CWG review of this paper and argue your
position, please feel free. I don't intend to argue it further in this
email thread.
>
>>>
>>>
>>> Please change
>>> The sequence of characters denoted by each contiguous
>>> sequence of basic-s-chars, r-chars,
>>> simple-escape-sequences ([lex.ccon]), and
>>> universal-character-names ([lex.charset]) is encoded to
>>> a code unit sequence
>>> To
>>> Each basic-s-chars, r-chars, simple-escape-sequences
>>> ([lex.ccon]), and universal-character-names
>>> ([lex.charset]) is encoded to a code unit sequence
>>>
>> The intent is to make it clear that these sequences are
>> encoded as a group. This is necessary for stateful encodings
>> with SI/SO characters since such characters don't necessarily
>> contribute a code unit sequence on their own. This was also
>> requested during the 2020-03-23 issues processing telecon
>> <https://wiki.edg.com/bin/view/Wg21summer2020/IssuesProcessingTeleconference2020-03-23>.
>>
>>
>> The effect is that I can encode things like e,U+0301 as a single
>> code unit, which at the very least should not be allowed in a
>> wording change.
> Please read the wording again. I don't think it states that. If
> you still think it does, please elaborate in detail.
>
>
> You use the term character ( which in this context is synonym of
> abstract character)
> The sequence of *characters* denoted by *each contiguous sequence *of
> basic-s-chars, r-chars, simple-escape-sequences ([lex.ccon]), and
> universal-character-names ([lex.charset]) is encoded to a code unit
> sequence using the string-literal's associated character encoding. If
> a *character* lacks [...]
>
> Maybe : Each codepoint denoted by a single basic-s-chars, r-chars,
> simple-escape-sequences ([lex.ccon]), and universal-character-names is
> encoded to a code unit sequence using the string-literal's associated
> character encoding. If that codepoint lacks representation in the
> associated character encoding,
>
> Note that codepoint isn't particularly meaningful in this context ,
> could be "element", for example. The point is the sequence is not
> converted as a whole.
> Changing that is design ( I don't have a terribly strong opinion
> either way, but it needs to be discussed outside of core, notably
> because it would allow implementation to handle combining characters
> differently).
Again, that wording reflects prior CWG guidance. Since the (wide)
execution encoding is implementation-defined, I don't agree that this
reflects a design change, but I defer to our benevolent CWG chair to
make that determination as he sees fit.
>
>> It's also a terrible reason as c-char and UCNs are Unicode
>> characters at this point and cannot correspond to a statefull
>> character as the source of the conversation. The thing they are
>> converted to being an implementation defined sequencee of code
>> unit, the possibility of a state shift is implied.
>
> What are you referring to as a "terrible reason"?
>
> That
> > The intent is to make it clear that these sequences are encoded as a
> group. This is necessary for stateful encodings with SI/SO characters
> since such characters don't necessarily contribute a code unit
> sequence on their own
>
> Either:
> - These characters appear as ucn and they should in fact contribute
> to a code unit sequence
> - They are used as part of a stateful source encoding and would have
> not been conserved past phase 1.
In discussion elsewhere, we've discussed the example of a decomposed 'é'
and noted that at least some compilers will encode the 'e' and combining
acute separately, perhaps substituting a character for the combining
acute if the execution character set doesn't support combining
characters as distinct characters. Unless someone has proven otherwise,
I believe it would be conforming for an implementation to convert the
decomposed code point sequence to a composed character that is
representable in the execution encoding (e.g., ISO-8859-1; noting that
such normalization is not desirable for Unicode encodings). I see no
reason to be more restrictive here.
>
> SI/SO characters exist in Unicode and can therefore be represented
> as UCNs. In translation phase 5, an implementation can treat them
> as part of a shift sequence when converting to the execution encoding.
>
> Again that is a design change
Per prior comments, I disagree and trust that the CWG chair will make an
appropriate determination about this.
>
>>>
>>>
>>>
>>> - please replace applicable character encoding
>>> by character encoding
>>>
>> That doesn't seem correct to me; the wording needs to
>> indicate which character encoding. Note that there are three
>> occurrences of "applicable associated character encoding";
>> I'm not sure which use you were referring to.
>>
>>
>> Missed a word. Sorry. Meant associated character encoding.
>> "Applicable associated" doesn't add anything. Maybe the "the
>> literal associated encoding"
>
> That says the same thing to me. If CWG expresses a preference,
> I'll change it.
>
> Yes it does, just trying to be consistent in terminology. associated
> literal encoding is consistent and what sg16 has been using (including
> you, maybe you came up with that term :p)

There are multiple "associated character encodings" to choose from;
"applicable" indicates which one to choose. Again, if the CWG requests
a change, I'll change it.

Tom.

> Tom.
>

Received on 2020-06-30 09:36:03