On Apr 19, 2024, at 9:52 AM, Jens Maurer <jens.maurer@gmx.net> wrote:
On 19/04/2024 15.09, Daveed Vandevoorde wrote:
On Apr 19, 2024, at 2:15 AM, Jens Maurer <jens.maurer@gmx.net> wrote:
On 19/04/2024 03.37, Daveed Vandevoorde via SG16 wrote:
On Apr 18, 2024, at 7:21 PM, Tom Honermann <tom@honermann.net> wrote:
... The contents of the string_view consist of characters of the basic source character set only (an implementation can map other characters using universal character names).
Right. You mentioned in Tokyo that that doesn’t work. Can you elaborate what the technical hickup is?
I ought to know better than to use the words "doesn't work" by now. They communicate nothing. The phrase simply doesn't work!
The thoughts Jens has already shared align well with my own, but
I'll extrapolate further below.
Universal-character-names are interpreted when transitioning
the lexical representation of a string-literal into an object
containing code units [lex.string], or when lexing tokens
outside of string-literals in translation phase 3 [lex.phases].
Nothing will interpret universal-character-names (as such) in
a string_view, because it is already assumed to be a range of
code units.
Right. But that’s mostly a UI issue, no? There is nothing that makes it “not work”. Only that if you want to get corresponding code units, some work will be needed on the consumer side. (Interestingly, the compiler already knows how to do the work for round-tripping support.)
I thought you wanted a "make_class_definition(string_view)" function.
Arguably, a string_view pointing to the code units "abc\u1234"
(no interpretation of the universal-character-name) should produce
an ill-formed class name (because a backslash is not allowed in
a class name).
Thanks for bringing that example up.
My point of view is that that should be valid if abc\u1234 is valid in source code as a class name.
The reflection facilities will obviously need to honor the identifier syntax as adopted via P1949R7 (C++ Identifier Syntax using Unicode Standard Annex 31).
For anyone wanting to play around or write tests, the set of characters that are valid as an initial identifier character can be viewed here. Valid continuation characters are here.
Here is a fun identifier that contains both combining characters
and right-to-left characters. "abcאב︠cba" (and no, the combination
makes no sense but does constitute a valid identifier).
If my literal encoding is UTF-8, "make_class_definition" should
support a use like this: make_class_definition("abc\u1234"), where
the universal-character-name is interpreted by the string-literal-to-
object conversion. Does that mean we get a double escape interpretation?
Yes.
Consider:
make_class_definition("abc\\u1234")
The string-literal-to-object conversion yields the code unit
sequence "abc\u1234" and then the consumer (i.e. make_class_definition)
interprets universal-character-names once more, and we get an
(ostensibly valid) class-name (assuming \u1234 is a valid identifier
character, which I don't know right now).
That double interpretation feels surprising and wrong.
Can you identify where that feeling comes from? It seems just right to me.
To my knowledge, there is nowhere else in the C or C++ standards where such double escape interpretation occurs.
It also creates potentially confusing (though not actually ambiguous) situations for encodings like Shift-JIS that may include double-byte character sequences in which one of the code units matches the code point for '\' but does not actually encode a '\' character. For these encodings, correctly identifying an escape sequence requires decoding; it isn't safe to just scan for a '\' (ASCII 0x5C) code unit.
From a user perspective, if the ordinary literal encoding is UTF-8, I would expect to be able to pass, for example, "һèḷḷỏ" to make_class_definition() and have the name correctly interpreted; the user shouldn't be required to pass "\\u04BB\\u00E8\\u1E37\\u1E37\\u1ECF".
Ideally, as a portable source and ordinary literal encoding
agnostic syntax, we would support "\u04BB\u00E8\u1E37\u1E37\u1ECF"
as a way to refer to an otherwise unutterable identifier but that
would be somewhere between really difficult and impossible to do
since constant evaluation occurs after string literals are
initialized with their source strings converted to the ordinary
literal encoding (which yields an ill-formed program if any of the
characters are not representable; [lex.string]p(10.1)).
Tom.
Daveed
Note that I’m not arguing that producing basic-source-character names/text is my preferred approach. I just want to understand if it has inherent implementation/semantic difficulties as a fallback if no other approach can be made to adequately work.
Ditto.
Jens