sg16: Re: [SG16] New draft revision: D2029R2 (Proposed resolution for core issues 411, 1656, and 2333; numeric and universal character escapes in character and string literals)

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Tue, 30 Jun 2020 06:26:12 +0200

On Tue, 30 Jun 2020 at 06:15, Corentin Jabot <corentinjabot_at_[hidden]>
wrote:

>
>
> On Tue, Jun 30, 2020, 05:52 Tom Honermann <tom_at_[hidden]> wrote:
>
>> On 6/28/20 2:03 AM, Corentin Jabot wrote:
>>
>>
>>
>> On Sun, 28 Jun 2020 at 07:37, Corentin Jabot <corentinjabot_at_[hidden]>
>> wrote:
>>
>>>
>>>
>>> On Sun, Jun 28, 2020, 06:50 Tom Honermann via SG16 <
>>> sg16_at_[hidden]> wrote:
>>>
>>>> A new draft revision of P2029 (Proposed resolution for core issues 411,
>>>> 1656, and 2333; numeric and universal character escapes in character and
>>>> string literals) is now available at
>>>> https://rawgit.com/sg16-unicode/sg16/master/papers/d2029r2.html. This
>>>> addresses the CWG feedback provided during the March 23rd, 2020 core
>>>> issues processing teleconference
>>>> <http://wiki.edg.com/bin/view/Wg21summer2020/IssuesProcessingTeleconference2020-03-23>
>>>> .
>>>>
>>>> Wording review feedback prior to the next Core issues processing
>>>> teleconference would be much appreciated!
>>>>
>>> I really like the overall direction, a few comments:
>>> - Can we not make conditionally supported escape sequences part of the
>>> grammar?
>>>
>> This was requested by Core in the 2020-01-16 issues processing telecon
>> <https://wiki.edg.com/bin/view/Wg21prague/IssuesProcessingTeleconference2020-01-16>
>> .
>>
>>
>> What I would do:
>> simple-escape-sequence:
>> any member of the basic source character set other than u, U, x, and
>> the members of octal-digit
>>
>> And in 5.13, keep
>> Escape sequences not listed in Table 9 are conditionally supported, with
>> implementation-defined semantics
>>
>> What problem would that solve?
>>
>
> Not having separated grammar for non standard features, simpler grammar.
>
>
>>
>> I would also keep
>> An escape sequence specifies a single fcode unit.
>>
>> The ability for a conditional escape sequence to specify a code unit
>> sequence was discussed during the 2020-03-23 issues processing telecon
>> <https://wiki.edg.com/bin/view/Wg21summer2020/IssuesProcessingTeleconference2020-03-23>.
>> Since such sequences are implementation-defined anyway, I don't know of any
>> reason to prohibit them expanding to multiple code units. For sequences
>> that specify a character, whether a single code unit is encoded or multiple
>> are should be determined by the character encoding. If we want to enforce
>> such a restriction, I think it belongs in [lex.charset]p3
>> <http://eel.is/c++draft/lex.charset#3> (I thought we already had
>> normative wording that requires members of the basic source character set
>> be encoded as a single code unit, but I don't see it now).
>>
>
> Makes sense.
>
>>
>>
>>
>>
>>
>>> - Can we not add notes for stateful encodings? It doesn't add anything.
>>>
>> Stateful encodings were discussed in the 2020-03-23 issues processing
>> telecon
>> <https://wiki.edg.com/bin/view/Wg21summer2020/IssuesProcessingTeleconference2020-03-23>
>> .
>>
>
> Sure, it is still a level of detail that doesn't add anything. I would
> like to avoid people in 30 years wondering why that this sentences are here.
>

Or US in a few weeks when we realize that strings are concatenated after
they are converted such as there could be a bunch of useless extra shift
states
introduced as an artifact of the wording concatenating after conversion,
which implementations don't do, etc.
Also, reading the minutes and the tea leaves, it seemed to have been
forgotten by core that the source of the conversions are never shift
states, as this is removed in phase 1.
We do want each ""character"" encoded separately, which is very different
to say that we do want each character to reset the shift state.

> -- Wide multi character literals were not a thing, let's not make them one
>>> now. same for conditional character literals and conditional wide
>>> character literals.
>>>
>>> Instead, please add text in (Z) to describe them?
>>> ie:
>>>
>>> -ordinary and wide characters literal consisting of a single
>>> basic-c-char, simple-escape-sequence, or universal-character-name that
>>> specifies a character that either lacks representation in the associated
>>> character encoding or that cannot be encoded as a single code unit
>>> are conditionally supported and have an implementation-defined value
>>> - A wide character literal consisting of multiple c-chars is
>>> conditionally-supported and has an implementation-defined value.
>>>
>> Giving these odd literals a name was suggested by Core. I agree with
>> their suggested direction; giving them a name makes it easier to discuss
>> and define them.
>>
>
>
> No, especially wide multi characters that are simply not a thing, let's
> not make them one. The reason multi character literals exists and have a
> name is because their type is different from character literals.
> Should I send a mail to core? Because I really do not like that direction.
> (Especially as what you call wide multi character literal doesn't behave at
> all as multi character literals). We should also look at making them ill
> formed rather than giving them a name
>
>>
>>>
>>> Please change
>>> The sequence of characters denoted by each contiguous sequence of
>>> basic-s-chars, r-chars, simple-escape-sequences ([lex.ccon]), and
>>> universal-character-names ([lex.charset]) is encoded to a code unit sequence
>>> To
>>> Each basic-s-chars, r-chars, simple-escape-sequences ([lex.ccon]), and
>>> universal-character-names ([lex.charset]) is encoded to a code unit sequence
>>>
>> The intent is to make it clear that these sequences are encoded as a
>> group. This is necessary for stateful encodings with SI/SO characters
>> since such characters don't necessarily contribute a code unit sequence on
>> their own. This was also requested during the 2020-03-23 issues
>> processing telecon
>> <https://wiki.edg.com/bin/view/Wg21summer2020/IssuesProcessingTeleconference2020-03-23>
>> .
>>
>
> The effect is that I can encode things like e,U+0301 as a single code
> unit, which at the very least should not be allowed in a wording change.
> It's also a terrible reason as c-char and UCNs are Unicode characters at
> this point and cannot correspond to a statefull character as the source of
> the conversation. The thing they are converted to being an implementation
> definedsequencee of code unit, the possibility of a state shift is implied.
>
>>
>>>
>>>
>>> - please replace applicable character encoding by character encoding
>>>
>> That doesn't seem correct to me; the wording needs to indicate which
>> character encoding. Note that there are three occurrences of "applicable
>> associated character encoding"; I'm not sure which use you were referring
>> to.
>>
>
> Missed a word. Sorry. Meant associated character encoding. "Applicable
> associated" doesn't add anything. Maybe the "the literal associated
> encoding"
>
>> - not sure replacing `\0` by null character is an improvement
>>>
>> It avoids a correction to state something like, "a '\0', L\'0', u8'\0',
>> u'\0', or U'\0' is appended ...". [lex.charset]p3
>> <http://eel.is/c++draft/lex.charset#3> defines *null character* (though
>> the definition there isn't perfect either, I think it is an improvement).
>>
> Good point
>
> Tomm.
>
>>
>>>
>>> Corentin
>>>
>>> Tom.
>>>> --
>>>> SG16 mailing list
>>>> SG16_at_[hidden]
>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>>
>>>
>>

Received on 2020-06-29 23:29:37