On Apr 19, 2024, at 9:52 AM, Jens Maurer <jens.maurer@gmx.net> wrote:

On 19/04/2024 15.09, Daveed Vandevoorde wrote:

On Apr 19, 2024, at 2:15 AM, Jens Maurer <jens.maurer@gmx.net> wrote:

On 19/04/2024 03.37, Daveed Vandevoorde via SG16 wrote:

On Apr 18, 2024, at 7:21 PM, Tom Honermann <tom@honermann.net> wrote:
... The contents of the string_view consist of characters of the basic source character set only (an implementation can map other characters using universal character names).

Right. You mentioned in Tokyo that that doesn’t work. Can you elaborate what the technical hickup is?

Universal-character-names are interpreted when transitioning
the lexical representation of a string-literal into an object
containing code units [lex.string], or when lexing tokens
outside of string-literals in translation phase 3 [lex.phases].

Nothing will interpret universal-character-names (as such) in
a string_view, because it is already assumed to be a range of
code units.

Right. But that’s mostly a UI issue, no? There is nothing that makes it “not work”. Only that if you want to get corresponding code units, some work will be needed on the consumer side. (Interestingly, the compiler already knows how to do the work for round-tripping support.)

I thought you wanted a "make_class_definition(string_view)" function.

Arguably, a string_view pointing to the code units "abc\u1234"
(no interpretation of the universal-character-name) should produce
an ill-formed class name (because a backslash is not allowed in
a class name).

Thanks for bringing that example up.

My point of view is that that should be valid if abc\u1234 is valid in source code as a class name.

If my literal encoding is UTF-8, "make_class_definition" should
support a use like this: make_class_definition("abc\u1234"), where
the universal-character-name is interpreted by the string-literal-to-
object conversion. Does that mean we get a double escape interpretation?

Yes.

Consider:
make_class_definition("abc\\u1234")

The string-literal-to-object conversion yields the code unit
sequence "abc\u1234" and then the consumer (i.e. make_class_definition)
interprets universal-character-names once more, and we get an
(ostensibly valid) class-name (assuming \u1234 is a valid identifier
character, which I don't know right now).

That double interpretation feels surprising and wrong.

Can you identify where that feeling comes from? It seems just right to me.

Daveed

Note that I’m not arguing that producing basic-source-character names/text is my preferred approach. I just want to understand if it has inherent implementation/semantic difficulties as a fallback if no other approach can be made to adequately work.

Ditto.

Jens