sg16: Re: [SG16] [isocpp-core] Updated draft revision: D2029R2 (Proposed resolution for core issues 411, 1656, and 2333; numeric and universal character escapes in character and string literals)

From: Tom Honermann <tom_at_[hidden]>
Date: Sat, 18 Jul 2020 02:48:35 -0400

On 7/15/20 3:21 AM, Richard Smith wrote:
> On Tue, 14 Jul 2020, 22:56 Tom Honermann, <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> On 7/14/20 3:23 AM, Richard Smith wrote:
>> On Mon, Jul 13, 2020 at 9:03 PM Tom Honermann via Core
>> <core_at_[hidden] <mailto:core_at_[hidden]>> wrote:
>>
>> On 7/8/20 1:54 PM, Tom Honermann wrote:
>>> On 7/8/20 6:43 AM, Alisdair Meredith wrote:
>>>> Minor nit: I dislike normatively stating that a null character is
>>>> appended after string concatenation in two places. I do like
>>>> the addition of this directly to the phase 6 wording, so suggest
>>>> that the original in [lex.string]p12 with its extra flowery language
>>>> be demoted to a note.
>>> That seems reasonable to me, I'll do so.
>>
>> After looking at this again, I elected to go in a different
>> direction.
>>
>> [lex.phases] describes at a high level what is to be done in
>> each phase and more-or-less defers to other sections for
>> elaboration. From this lens, changing the normative text in
>> [lex.string] into a note felt like the wrong direction.
>> Instead, I chose to update the wording in [lex.string] to
>> read a little nicer and to omit the flowery language. I then
>> updated [lex.phases] to be less precise and to explicitly
>> direct the reader to [lex.string] for details. I hope this
>> acceptably satisfies the (very reasonable) concern about the
>> previous normative duplication.
>>
>> This paper has now been submitted for the upcoming mailing
>> and can be found at
>> https://isocpp.org/files/papers/P2029R2.html. The previous
>> links to the draft will no longer work.
>>
>> Apologies for not looking through this earlier.
> No problem, thank you for the feedback!
>>
>> """
>> conditional-escape-sequence-char:
>> any member of the basic source character set other than u,
>> U, x, and the members of octal-digit and simple-escape-sequence-char
>> """
>>
>> I don't like talking about "members of" grammar productions. How
>> about:
>>
>> any member of the basic source character set that is not an
>> /octal-digit/, a /simple-escape-sequence-char/, or u, U, or x
> Ah, yes, that is better. I updated the paper and this change will
> be included in the mailing. Preview at
> https://isocpp.org/files/papers/P2029R2.html.
>>
>> 5.13.3/Z.2.1:
>> """
>> — If v does not exceed the range of the character-literal's type,
>> then the value is v.
>> """
>>
>> What does "the range of the character-literal's type" mean? Do
>> you mean the range of representable values? Or do you mean
>> [0,0xFFFF] for char16_t and [0,0x10FFFF] for char32_t?
> I meant the range of representable values. The general thinking
> is that numeric escape sequences are allowed to encode code unit
> values that are not valid according to the associated character
> encoding. Therefore, such sequences can encode 0xFF for UTF-8
> literals, 0xFFFF for UTF-8 literals (if the underlying type of
> char8_t is greater than 8 bits), 0x1FFFF for UTF-16 literals (if
> the underlying type of char16_t is greater than 16 bits), etc...
> While we could place more restrictions here, doing so would curb
> freedoms without, in my opinion, offering significant assurances
> of well-formed code unit sequences.
>
>
> Thanks. For what it's worth, I think "range of representable values"
> is the right rule to use. (But the wording should use that term.)
Agreed that the wording should use that term. Updated in a new draft
revision: https://rawgit.com/sg16-unicode/sg16/master/papers/d2029r3.html.
>
>> 5.13.3/Z.2.2:
>> """
>> — Otherwise, if the character-literal's encoding-prefix is absent
>> or L, then the value is implementation-defined.
>> """
>>
>> I appreciate that your wording reflects the behavior of the prior
>> wording, but while we're here: do we really want '\ff' to have an
>> implementation-defined value rather than being required to be
>> (char)0xff (assuming 'char' is signed and 8-bit)? Now we
>> guarantee 2s complement, perhaps we should just say you always
>> get the result of converting the given value to char / wchar_t?
>> (Similarly in 5.3.15/Z.2.)
>
> That seems reasonable, and I believe matches existing practice,
> but I'm not sure how to word it. Would we address cases like
> '\xFFFF' (with the same sign/size assumptions) explicitly? I
> don't think we can defer to the integral conversion rules since
> the source value doesn't have a specific type (the wording states
> "an integer value v". Perhaps we could steal the "type that is
> congruent to the source integer modulo 2N" wording?
>
> """
> — Otherwise, if the character-literal's encoding-prefix is absent
> or L, then the value is the unique value of the
> /character-literal/s type t that is congruent to v modulo 2N,
> where N is the width of t.
> """
>
> Yes, that it something like it seems quite reasonable to me.

I looked into this and found that gcc 10.1, clang 10, Visual C++ 19.24,
and icc 19 all accept '\xff' and produce a value of -1 as expected, but
for '\x100', gcc and icc emit a warning, and Clang and Visual C++
reject. https://www.godbolt.org/z/6qa1b7. That leads me to believe this
should be considered more of an evolutionary change and addressed in a
different paper.

Tom.

> Tom.
>
>>
>> Thanks!
>>
>> Tom.
>>
>>>> In the normative text, AFAICT, in C++20 wide multi character
>>>> literals must be supported, with an implementation-defined value,
>>>> but after this paper they will be conditionally supported. I don’t
>>>> see that design change addressed in the front matter. Same
>>>> applies to non-encodable wide characters .
>>>
>>> That is addressed in the "Proposed resolution overview"
>>> section. I can add a statement about this to the
>>> introduction if you like.
>>>
>>> I've been under the impression that the lack of
>>> conditionally-supported for these is an oversight. My
>>> understanding (and someone please correct me if I'm
>>> mistaken; I don't recall where I was informed of this) is
>>> that, in the C standard, implementation-defined includes an
>>> allowance for rejecting the code as ill-formed, but in the
>>> C++ standard, implementation-defined implies well-formed;
>>> hence the addition of conditionally-supported. If that
>>> understanding is correct, then the updated wording corrects
>>> alignment with the intent of the C standard.
>>>
>>>> (I thought this also applied to ordinary multi character literals,
>>>> but it turns out they are already conditionally supported.)
>>>
>>> Yup, in [lex.ccon]p1
>>> <http://eel.is/c++draft/lex.ccon#1.sentence-4>.
>>>
>>> Tom.
>>>
>>>> AlisdairM
>>>>
>>>>> On Jul 7, 2020, at 16:33, Tom Honermann via Core<core_at_[hidden]> <mailto:core_at_[hidden]> wrote:
>>>>>
>>>>> An update of D2029R2 (Proposed resolution for core issues 411, 1656, and 2333; numeric and universal character escapes in character and string literals) is now available athttps://rawgit.com/sg16-unicode/sg16/master/papers/d2029r2.html. This addresses the feedback provided on the core mailing list in the thread starting athttps://lists.isocpp.org/core/2020/06/9455.php.
>>>>>
>>>>> Wording review feedback prior to the next Core issues processing teleconference would be much appreciated!
>>>>>
>>>>> Tom.
>>>>>
>>>>> _______________________________________________
>>>>> Core mailing list
>>>>> Core_at_[hidden] <mailto:Core_at_[hidden]>
>>>>> Subscription:https://lists.isocpp.org/mailman/listinfo.cgi/core
>>>>> Link to this post:http://lists.isocpp.org/core/2020/07/9545.php
>>>
>>>
>>
>> _______________________________________________
>> Core mailing list
>> Core_at_[hidden] <mailto:Core_at_[hidden]>
>> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
>> Link to this post: http://lists.isocpp.org/core/2020/07/9570.php
>>
>

Received on 2020-07-18 01:51:59