C++ Logo

sg16

Advanced search

Re: Agenda for the 2024-04-24 SG16 meeting

From: Jens Maurer <jens.maurer_at_[hidden]>
Date: Fri, 19 Apr 2024 15:52:10 +0200
On 19/04/2024 15.09, Daveed Vandevoorde wrote:
>
>
>> On Apr 19, 2024, at 2:15 AM, Jens Maurer <jens.maurer_at_[hidden]> wrote:
>>
>>
>>
>> On 19/04/2024 03.37, Daveed Vandevoorde via SG16 wrote:
>>>
>>>
>>>> On Apr 18, 2024, at 7:21 PM, Tom Honermann <tom_at_[hidden]> wrote:
>>>> ... The contents of the string_view consist of characters of the basic source character set only (an implementation can map other characters using universal character names).
>>>>
>>>
>>> Right. You mentioned in Tokyo that that doesn’t work. Can you elaborate what the technical hickup is?
>>
>> Universal-character-names are interpreted when transitioning
>> the lexical representation of a string-literal into an object
>> containing code units [lex.string], or when lexing tokens
>> outside of string-literals in translation phase 3 [lex.phases].
>>
>> Nothing will interpret universal-character-names (as such) in
>> a string_view, because it is already assumed to be a range of
>> code units.
>
> Right. But that’s mostly a UI issue, no? There is nothing that makes it “not work”. Only that if you want to get corresponding code units, some work will be needed on the consumer side. (Interestingly, the compiler already knows how to do the work for round-tripping support.)

I thought you wanted a "make_class_definition(string_view)" function.

Arguably, a string_view pointing to the code units "abc\u1234"
(no interpretation of the universal-character-name) should produce
an ill-formed class name (because a backslash is not allowed in
a class name).

If my literal encoding is UTF-8, "make_class_definition" should
support a use like this: make_class_definition("abc\u1234"), where
the universal-character-name is interpreted by the string-literal-to-
object conversion. Does that mean we get a double escape interpretation?

Consider:
make_class_definition("abc\\u1234")

The string-literal-to-object conversion yields the code unit
sequence "abc\u1234" and then the consumer (i.e. make_class_definition)
interprets universal-character-names once more, and we get an
(ostensibly valid) class-name (assuming \u1234 is a valid identifier
character, which I don't know right now).

That double interpretation feels surprising and wrong.

> Note that I’m not arguing that producing basic-source-character names/text is my preferred approach. I just want to understand if it has inherent implementation/semantic difficulties as a fallback if no other approach can be made to adequately work.

Ditto.

Jens

Received on 2024-04-19 13:52:27