C++ Logo

sg16

Advanced search

Re: Agenda for the 2024-04-24 SG16 meeting

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Sat, 20 Apr 2024 09:46:17 +0200
Getting a name from reflection:
   We can't know how the string will be used, so it needs to follow the
rules of C++: Either it is a u8 string, and is utf-8 encoding, or it is a
non-utf string in the literal encoding (might be ebcdic, etc). Only utf-8
(or another unicode encoding) can represent all identifiers.

Feeding a string to reflection:
    It can either be u8, or potentially be a narrow string. The issue there
is that the mapping from an arbitrary encoding to UTF-8 isn't necessarily
portable. There is a guarantee that a mapping exists but that mapping does
not need to be unique.
    This is an academic concern (the mapping from source to utf-8 is
similarly implementation defined and not unique but we are not aware of
that causing portability issues)

Feeding escaped strings:
   We would have to justify that this solution is less cumbersome for users
than passing a u8 string. Parsing would have to be done in the literal
encoding, some encodings cannot represent \, etc).
   Generally, I have the same concerns as Jens/Tom.







On Sat, Apr 20, 2024 at 12:13 AM Tom Honermann via SG16 <
sg16_at_[hidden]> wrote:

> On 4/19/24 10:05 AM, Daveed Vandevoorde via SG16 wrote:
>
>
>
> On Apr 19, 2024, at 9:52 AM, Jens Maurer <jens.maurer_at_[hidden]>
> <jens.maurer_at_[hidden]> wrote:
>
>
>
> On 19/04/2024 15.09, Daveed Vandevoorde wrote:
>
>
>
> On Apr 19, 2024, at 2:15 AM, Jens Maurer <jens.maurer_at_[hidden]>
> <jens.maurer_at_[hidden]> wrote:
>
>
>
> On 19/04/2024 03.37, Daveed Vandevoorde via SG16 wrote:
>
>
>
> On Apr 18, 2024, at 7:21 PM, Tom Honermann <tom_at_[hidden]>
> <tom_at_[hidden]> wrote:
> ... The contents of the string_view consist of characters of the basic
> source character set only (an implementation can map other characters using
> universal character names).
>
>
> Right. You mentioned in Tokyo that that doesn’t work. Can you elaborate
> what the technical hickup is?
>
> I ought to know better than to use the words "doesn't work" by now. They
> communicate nothing. The phrase simply doesn't work!
>
> The thoughts Jens has already shared align well with my own, but I'll
> extrapolate further below.
>
>
> Universal-character-names are interpreted when transitioning
> the lexical representation of a string-literal into an object
> containing code units [lex.string], or when lexing tokens
> outside of string-literals in translation phase 3 [lex.phases].
>
> Nothing will interpret universal-character-names (as such) in
> a string_view, because it is already assumed to be a range of
> code units.
>
>
> Right. But that’s mostly a UI issue, no? There is nothing that makes it
> “not work”. Only that if you want to get corresponding code units, some
> work will be needed on the consumer side. (Interestingly, the compiler
> already knows how to do the work for round-tripping support.)
>
>
> I thought you wanted a "make_class_definition(string_view)" function.
>
> Arguably, a string_view pointing to the code units "abc\u1234"
> (no interpretation of the universal-character-name) should produce
> an ill-formed class name (because a backslash is not allowed in
> a class name).
>
>
> Thanks for bringing that example up.
>
> My point of view is that that should be valid if abc\u1234 is valid in
> source code as a class name.
>
> The reflection facilities will obviously need to honor the identifier
> syntax as adopted via P1949R7 (C++ Identifier Syntax using Unicode
> Standard Annex 31) <https://wg21.link/p1949>.
>
> For anyone wanting to play around or write tests, the set of characters
> that are valid as an initial identifier character can be viewed here
> <https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3AXID_start%3A%5D_&g=&i=>.
> Valid continuation characters are here
> <https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3AXID_continue%3A%5D&g=&i=>
> .
>
> Here is a fun identifier that contains both combining characters and
> right-to-left characters. "abcאב︠cba" (and no, the combination makes no
> sense but does constitute a valid identifier).
>
>
>
> If my literal encoding is UTF-8, "make_class_definition" should
> support a use like this: make_class_definition("abc\u1234"), where
> the universal-character-name is interpreted by the string-literal-to-
> object conversion. Does that mean we get a double escape interpretation?
>
>
> Yes.
>
>
> Consider:
> make_class_definition("abc\\u1234")
>
> The string-literal-to-object conversion yields the code unit
> sequence "abc\u1234" and then the consumer (i.e. make_class_definition)
> interprets universal-character-names once more, and we get an
> (ostensibly valid) class-name (assuming \u1234 is a valid identifier
> character, which I don't know right now).
>
> That double interpretation feels surprising and wrong.
>
>
> Can you identify where that feeling comes from? It seems just right to me.
>
> To my knowledge, there is nowhere else in the C or C++ standards where
> such double escape interpretation occurs.
>
> It also creates potentially confusing (though not actually ambiguous)
> situations for encodings like Shift-JIS that may include double-byte
> character sequences in which one of the code units matches the code point
> for '\' but does not actually encode a '\' character. For these encodings,
> correctly identifying an escape sequence requires decoding; it isn't safe
> to just scan for a '\' (ASCII 0x5C) code unit.
>
> From a user perspective, if the ordinary literal encoding is UTF-8, I
> would expect to be able to pass, for example, "һèḷḷỏ" to
> make_class_definition() and have the name correctly interpreted; the user
> shouldn't be required to pass "\\u04BB\\u00E8\\u1E37\\u1E37\\u1ECF".
>
> Ideally, as a portable source and ordinary literal encoding agnostic
> syntax, we would support "\u04BB\u00E8\u1E37\u1E37\u1ECF" as a way to
> refer to an otherwise unutterable identifier but that would be somewhere
> between really difficult and impossible to do since constant evaluation
> occurs after string literals are initialized with their source strings
> converted to the ordinary literal encoding (which yields an ill-formed
> program if any of the characters are not representable;
> [lex.string]p(10.1) <http://eel.is/c++draft/lex.string#10.1>).
>
> Tom.
>
>
> Daveed
>
>
> Note that I’m not arguing that producing basic-source-character names/text
> is my preferred approach. I just want to understand if it has inherent
> implementation/semantic difficulties as a fallback if no other approach can
> be made to adequately work.
>
>
> Ditto.
>
> Jens
>
>
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2024-04-20 07:46:38