ISOCPP sg16 List: Re: [isocpp-sg16] Agenda for the 2024-05-08 SG16 meeting

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Tue, 7 May 2024 19:13:54 +0200

On Tue, May 7, 2024 at 5:57 PM Tom Honermann <tom_at_[hidden]> wrote:

> On 5/7/24 3:19 AM, Corentin Jabot wrote:
>
>
>
> On Mon, May 6, 2024 at 8:34 PM Tom Honermann via SG16 <
> sg16_at_[hidden]> wrote:
>
>> On 5/6/24 1:22 PM, Tom Honermann via SG16 wrote:
>>
>> SG16 will hold a meeting on Wednesday, May 8th, at 19:30 UTC (timezone
>> conversion
>> <https://www.timeanddate.com/worldclock/converter.html?iso=20240508T193000&p1=1440&p2=tz_pdt&p3=tz_mdt&p4=tz_cdt&p5=tz_edt&p6=tz_cest>
>> ).
>>
>> The agenda follows.
>>
>> - D3258R0: Formatting of charN_t <https://wg21.link/d3258r0>.
>> - P2996R2: Reflection for C++26 <http://wg21.link/p2996r2>.
>>
>> D3258R0 was hastily produced by Corentin following the review of P2996R2
>> during the 2024-04-24 SG16 meeting
>> <https://github.com/sg16-unicode/sg16-meetings/#april-24th-2024> with
>> the goal of providing a convenient solution for printing UTF-8 text held in
>> char8_t-based storage. It proposes extending std::format() and
>> std::print() to support formatting arguments of Unicode character type
>> (characters and strings of char8_t, char16_t, or char32_t type). It does
>> not propose a solution for iostreams. We won't poll this paper during this
>> meeting for two reasons: 1) the paper is hot off the press and I don't
>> expect everyone to have already read it and internalized all the
>> implications, and 2) I'm going to limit discussion of it to the first half
>> of the meeting so that we continue to make progress on P2996. The intent in
>> discussing it, particularly with the P2996 authors present, is to build a
>> sense of whether it suffices to at least minimally address the printing
>> requirements posed by the P2996 authors; we may take a poll on that point.
>>
>> Our recent review of P2996R2 was constructive but not conclusive. We'll
>> continue discussion with a goal of establishing consensus on the following
>> points. Please review the meeting summary from the last review
>> <https://github.com/sg16-unicode/sg16-meetings/#april-24th-2024> as well
>> as the ensuing "Follow up on SG16 review of P2996R2" discussion on the
>> SG16 mailing list <https://lists.isocpp.org/sg16/2024/04/index.php>
>> prior to the meeting.
>>
>> 1. The character type(s) and encoding(s) used for names produced and
>> consumed by reflection interfaces. My sense is that we're leaning in the
>> following direction (not unanimously though):
>> 1. Names will be produced and consumed in both the ordinary
>> literal encoding via type char and UTF-8 via type char8_t.
>> 2. Production of names that contain characters that are not
>> representable in the ordinary literal encoding will produce a string that
>> contains a UCN-like escape sequence for such characters.
>> 3. Consumption of names in the ordinary literal encoding will
>> accept a UCN-like escape sequence for characters not in the basic literal
>> character set that may lack representation in the ordinary literal encoding.
>> 2. The use of a distinct type for names (e.g., a type that stores
>> names in an internal representation and exposes them via char and
>> char8_t interfaces).
>> 3. Unicode NFC requirements (see below).
>>
>> We briefly discussed Unicode normalization form C (NFC) last time.
>> Following adoption of P1949R7 (C++ Identifier Syntax using Unicode
>> Standard Annex 31) <https://wg21.link/p1949r7> as a DR for C++23,
>> identifiers are required to be written in NFC. Conversion to the ordinary
>> literal encoding could result in names that are not in NFC. It will
>> presumably be necessary for P2996 to specify that, for round-trip purposes,
>> conversion to the ordinary literal encoding will not perform character
>> substitutions (e.g., UNC-like escape sequences will be generated instead).
>> Likewise, it will be necessary to specify how names that do not conform to
>> NFC will be handled by reflection interfaces that consume user provided
>> names. Note that current compiler releases exhibit implementation
>> divergence with respect to enforcement of the NFC requirement (
>> https://godbolt.org/z/E35r1K7hE; gcc does diagnose, Clang and EDG do
>> not, MSVC does not yet implement P1949R7).
>>
>>
> No.
>
> Maybe! :)
>
> We discussed that we cannot guarantee round tripping through arbitrary
> encoding as there is no spec guaranteeing a mapping and Unicode has
> duplicate representations of the same abstract characters.
>
> That was the motivation for the "It will presumably be necessary ..."
> statement above.
>
> This is no different than the mapping that happens in phase 1.
>
> There is a difference in that translation phase 1 is unidirectional. In
> this case, we have a round-trip requirement.
>
> We observed that a lot of duplicate characters normalize to the same
> thing, making it less of a concern.
> I think we agreed (or at least that's where I wanted to get at), that
> while we cannot promise round tripping in all cases, it's not enough of a
> concern to worry about and ought not to impact the design.
>
> That doesn't match my recollection. I had stated that more investigation
> and analysis is needed and that we weren't going to be able to resolve such
> questions during the last meeting.
>
>
> Whether a character can be represented at all in a non-unicode encoding is
> a much more prevalent question than whether duplicates round trip portably.
> I remain strongly opposed to any form of invention along the lines of
> novel escape sequences as this greatly reduce the portability of C++
> program and put undue burden on users/the ecosystem
>
> I understand the resistance. We have several choices:
>
> 1. Make it impossible to name some identifiers in char-based
> interfaces when the ordinary literal encoding is not UTF-8.
> 2. Enable the ability to name all identifiers in char-based interfaces
> regardless of the choice of ordinary literal encoding by using some form of
> escape sequences.
> 3. Not support char-based interfaces at all.
>
> I would argue that a UNC-like escape sequence is not novel considering
> that std::format() already produces such escape sequences in
> [format.string.escaped] <http://eel.is/c++draft/format.string.escaped>.
> I'm confused by your statement that use of such escape sequences would harm
> portability. It seems to me that an escape sequence actually enables a way
> to write more portable programs. Perhaps we have a different understanding
> of the scope of what we're considering here.
>
The design needs to be driven by use cases (and ideally not by
theoretical concerns that should inform the specification but not
necessarily the interface)

There are 2 broad categories of use cases:
1/ Display the identifiers, for debugging, diagnostics and arguably
documentation, i.e. use cases for which losing information is somewhat
acceptable and somewhat unavoidable depending on scenario.

2/ Using the identifier to generate some code. That can be using reflection
features, ie data_member_spec (although if you want to roundtrip there you
don't actually care about the string, you just want to preserve identity;
You would care if you wanted to do something like
data_member_spec(std::format("{}_foo",
name_of(^bar))) ).
But there are lots of other cases where you would use the identifier to
produce code. For example, run time reflection, python bindings, or
arbitrary language binding. Maybe json serialization or what not.

All of these scenario will have the same challenges:

Python will not understand our custom escape sequence so it will just
randomly fail on some identifiers until someone does the work to implement
a "escaped c++ identifiers to utf8" in their python framework. or js
framework, or all the 100s of tools that will exist.
Same for runtime reflection. Nothing would understand the escape sequences,
with the exceptions of some magic functions. And you would have to unescape
them in any scenario that interacts with users or external systems.
Same thing for json serialization, databases, network protocol, etc. Either
C++ has to unescape (and unescaping forces us to answer the question that
escaping was supposed to avoid), or external systems are burdened with that
c++ oddities. Either way the users have to do extra work.
Because ultimately, however you look at it, escaping is just an encoding
mechanism.
Now we have N+1 problem.

Ultimately, I agree with Victor. If we really want to optimize for round
tripping, and support both narrow encoding utf-8 semi-transparently a magic
object is not the worst idea.
But if we are trying to find a text encoding scheme that does not lose
data, I would suggest we use the one we already have :)

>
> (And yes, as discussed previously clang does not enforce normalization
> yet.)
>
> Hmm, perhaps Clang should be claiming partial support for P1949R7 here
> <https://clang.llvm.org/cxx_status.html#cxx23> then (with a footnote).
>
> Tom.
>
>
>
>> Thank you to Robin for pointing out an error in my use of Compiler
>> Explorer linked above; I neglected to add the /source-charset:utf-8
>> option for MSVC, so the source code wasn't interpreted correctly. Corrected
>> at https://godbolt.org/z/x1nxGfrYq; MSVC does not diagnose. According to MSVC
>> documenation
>> <https://learn.microsoft.com/en-us/cpp/overview/visual-cpp-language-conformance>,
>> P1949R7 is not yet implemented (Clang and EDG both document it as
>> implemented, but fail to diagnose).
>>
>> Tom.
>>
>> Finally, and as a separable issue that can be discussed at another time,
>> I think we should discuss differentiating between names and identifiers in
>> the reflection interfaces. This isn't an issue for data_member_spec()
>> since data members are always identifiers (or are unnamed; that is another
>> interesting case, but isn't an SG16 concern), but could be an issue for a
>> hypothetical function_spec() or member_function_spec() interface used
>> for named functions, constructors and destructors, overloaded operators,
>> conversion operators, user-defined literals, etc.... Distinguishing between
>> names and identifiers would avoid the need to parse, e.g., operator bool
>> or ""_udl, when consuming names.
>>
>> Tom.
>>
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>

Received on 2024-05-07 17:14:14