C++ Logo

sg16

Advanced search

Re: Agenda for the 2024-04-24 SG16 meeting

From: Jens Maurer <jens.maurer_at_[hidden]>
Date: Fri, 19 Apr 2024 16:45:37 +0200
On 19/04/2024 16.05, Daveed Vandevoorde wrote:
>
>
>> On Apr 19, 2024, at 9:52 AM, Jens Maurer <jens.maurer_at_[hidden]> wrote:

>> I thought you wanted a "make_class_definition(string_view)" function.
>>
>> Arguably, a string_view pointing to the code units "abc\u1234"
>> (no interpretation of the universal-character-name) should produce
>> an ill-formed class name (because a backslash is not allowed in
>> a class name).
>
> Thanks for bringing that example up.
>
> My point of view is that that should be valid if abc\u1234 is valid in source code as a class name.
>
>
>> If my literal encoding is UTF-8, "make_class_definition" should
>> support a use like this: make_class_definition("abc\u1234"), where
>> the universal-character-name is interpreted by the string-literal-to-
>> object conversion. Does that mean we get a double escape interpretation?
>
> Yes.
>
>>
>> Consider:
>> make_class_definition("abc\\u1234")
>>
>> The string-literal-to-object conversion yields the code unit
>> sequence "abc\u1234" and then the consumer (i.e. make_class_definition)
>> interprets universal-character-names once more, and we get an
>> (ostensibly valid) class-name (assuming \u1234 is a valid identifier
>> character, which I don't know right now).
>>
>> That double interpretation feels surprising and wrong.
>
> Can you identify where that feeling comes from? It seems just right to me.

The interpretation of universal-character-names is a source code transformation
(for tokens outside of string-literals, it's in phase 3, even before macros
are expanded).

Other than by line splicing (phase 2), we never create a character sequence
that is a universal-character-name and have it interpreted as such by the
processing in the phases of translation. [lex.string] p8 makes an extra
effort not to form universal-character-names by string-literal concatenation,
for example. Now, we can:

make_class_definition("abc\u005c" "u" "1234") // class-name is abc\u1234

I don't want phase 7 evaluation (including constant evaluation) to ever
"see" universal-character-names; this should only see characters or
code units.

If that means we have to specify the ordinary literal encoding to
be UTF-8 for compile-time evaluations, so be it. (I haven't thought about
whether we can clearly delineate the transition to runtime evaluation,
where we definitely need to support e.g. EBCDIC as the ordinary literal
encoding. Also, there are some rumors that certain EBCDIC things can't
round-trip through Unicode.)

Jens

Received on 2024-04-19 14:45:47