ISOCPP sg16 List: Re: Agenda for the 2024-04-24 SG16 meeting

From: Tom Honermann <tom_at_[hidden]>
Date: Fri, 19 Apr 2024 18:13:54 -0400

On 4/19/24 10:05 AM, Daveed Vandevoorde via SG16 wrote:
>
>
>> On Apr 19, 2024, at 9:52 AM, Jens Maurer <jens.maurer_at_[hidden]> wrote:
>>
>>
>>
>> On 19/04/2024 15.09, Daveed Vandevoorde wrote:
>>>
>>>
>>>> On Apr 19, 2024, at 2:15 AM, Jens Maurer <jens.maurer_at_[hidden]> wrote:
>>>>
>>>>
>>>>
>>>> On 19/04/2024 03.37, Daveed Vandevoorde via SG16 wrote:
>>>>>
>>>>>
>>>>>> On Apr 18, 2024, at 7:21 PM, Tom Honermann <tom_at_[hidden]> wrote:
>>>>>> ... The contents of the string_view consist of characters of
>>>>>> the basic source character set only (an implementation can map
>>>>>> other characters using universal character names).
>>>>>>
>>>>>
>>>>> Right. You mentioned in Tokyo that that doesn’t work. Can you
>>>>> elaborate what the technical hickup is?

I ought to know better than to use the words "doesn't work" by now. They
communicate nothing. The phrase simply doesn't work!

The thoughts Jens has already shared align well with my own, but I'll
extrapolate further below.

>>>>
>>>> Universal-character-names are interpreted when transitioning
>>>> the lexical representation of a string-literal into an object
>>>> containing code units [lex.string], or when lexing tokens
>>>> outside of string-literals in translation phase 3 [lex.phases].
>>>>
>>>> Nothing will interpret universal-character-names (as such) in
>>>> a string_view, because it is already assumed to be a range of
>>>> code units.
>>>
>>> Right. But that’s mostly a UI issue, no? There is nothing that
>>> makes it “not work”. Only that if you want to get corresponding code
>>> units, some work will be needed on the consumer side.
>>> (Interestingly, the compiler already knows how to do the work for
>>> round-tripping support.)
>>
>> I thought you wanted a "make_class_definition(string_view)" function.
>>
>> Arguably, a string_view pointing to the code units "abc\u1234"
>> (no interpretation of the universal-character-name) should produce
>> an ill-formed class name (because a backslash is not allowed in
>> a class name).
>
> Thanks for bringing that example up.
>
> My point of view is that that should be valid if abc\u1234 is valid in
> source code as a class name.

The reflection facilities will obviously need to honor the identifier
syntax as adopted via P1949R7 (C++ Identifier Syntax using Unicode
Standard Annex 31) <https://wg21.link/p1949>.

For anyone wanting to play around or write tests, the set of characters
that are valid as an initial identifier character can be viewed here
<https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3AXID_start%3A%5D_&g=&i=>.
Valid continuation characters are here
<https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3AXID_continue%3A%5D&g=&i=>.

Here is a fun identifier that contains both combining characters and
right-to-left characters. "abcאב︠cba" (and no, the combination makes no
sense but does constitute a valid identifier).

>
>
>> If my literal encoding is UTF-8, "make_class_definition" should
>> support a use like this: make_class_definition("abc\u1234"), where
>> the universal-character-name is interpreted by the string-literal-to-
>> object conversion. Does that mean we get a double escape
>> interpretation?
>
> Yes.
>
>>
>> Consider:
>> make_class_definition("abc\\u1234")
>>
>> The string-literal-to-object conversion yields the code unit
>> sequence "abc\u1234" and then the consumer (i.e. make_class_definition)
>> interprets universal-character-names once more, and we get an
>> (ostensibly valid) class-name (assuming \u1234 is a valid identifier
>> character, which I don't know right now).
>>
>> That double interpretation feels surprising and wrong.
>
> Can you identify where that feeling comes from? It seems just right
> to me.

To my knowledge, there is nowhere else in the C or C++ standards where
such double escape interpretation occurs.

It also creates potentially confusing (though not actually ambiguous)
situations for encodings like Shift-JIS that may include double-byte
character sequences in which one of the code units matches the code
point for '\' but does not actually encode a '\' character. For these
encodings, correctly identifying an escape sequence requires decoding;
it isn't safe to just scan for a '\' (ASCII 0x5C) code unit.

From a user perspective, if the ordinary literal encoding is UTF-8, I
would expect to be able to pass, for example, "һèḷḷỏ" to
make_class_definition() and have the name correctly interpreted; the
user shouldn't be required to pass "\\u04BB\\u00E8\\u1E37\\u1E37\\u1ECF".

Ideally, as a portable source and ordinary literal encoding agnostic
syntax, we would support "\u04BB\u00E8\u1E37\u1E37\u1ECF" as a way to
refer to an otherwise unutterable identifier but that would be somewhere
between really difficult and impossible to do since constant evaluation
occurs after string literals are initialized with their source strings
converted to the ordinary literal encoding (which yields an ill-formed
program if any of the characters are not representable;
[lex.string]p(10.1) <http://eel.is/c++draft/lex.string#10.1>).

Tom.

>
> Daveed
>
>>
>>> Note that I’m not arguing that producing basic-source-character
>>> names/text is my preferred approach. I just want to understand if
>>> it has inherent implementation/semantic difficulties as a fallback
>>> if no other approach can be made to adequately work.
>>
>> Ditto.
>>
>> Jens
>
>

Received on 2024-04-19 22:13:56