C++ Logo

sg16

Advanced search

Re: [isocpp-sg16] Agenda for the 2024-05-08 SG16 meeting

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 7 May 2024 14:48:14 -0400
On 5/7/24 1:13 PM, Corentin Jabot wrote:
>
>
> On Tue, May 7, 2024 at 5:57 PM Tom Honermann <tom_at_[hidden]> wrote:
>
> On 5/7/24 3:19 AM, Corentin Jabot wrote:
>>
>>
>> On Mon, May 6, 2024 at 8:34 PM Tom Honermann via SG16
>> <sg16_at_[hidden]> wrote:
>>
>> On 5/6/24 1:22 PM, Tom Honermann via SG16 wrote:
>>>
>>> SG16 will hold a meeting on Wednesday, May 8th, at 19:30 UTC
>>> (timezone conversion
>>> <https://www.timeanddate.com/worldclock/converter.html?iso=20240508T193000&p1=1440&p2=tz_pdt&p3=tz_mdt&p4=tz_cdt&p5=tz_edt&p6=tz_cest>).
>>>
>>> The agenda follows.
>>>
>>> * D3258R0: Formatting of charN_t <https://wg21.link/d3258r0>.
>>> * P2996R2: Reflection for C++26 <http://wg21.link/p2996r2>.
>>>
>>> D3258R0 was hastily produced by Corentin following the
>>> review of P2996R2 during the 2024-04-24 SG16 meeting
>>> <https://github.com/sg16-unicode/sg16-meetings/#april-24th-2024>
>>> with the goal of providing a convenient solution for
>>> printing UTF-8 text held in char8_t-based storage. It
>>> proposes extending std::format() and std::print() to support
>>> formatting arguments of Unicode character type (characters
>>> and strings of char8_t, char16_t, or char32_t type). It does
>>> not propose a solution for iostreams. We won't poll this
>>> paper during this meeting for two reasons: 1) the paper is
>>> hot off the press and I don't expect everyone to have
>>> already read it and internalized all the implications, and
>>> 2) I'm going to limit discussion of it to the first half of
>>> the meeting so that we continue to make progress on P2996.
>>> The intent in discussing it, particularly with the P2996
>>> authors present, is to build a sense of whether it suffices
>>> to at least minimally address the printing requirements
>>> posed by the P2996 authors; we may take a poll on that point.
>>>
>>> Our recent review of P2996R2 was constructive but not
>>> conclusive. We'll continue discussion with a goal of
>>> establishing consensus on the following points. Please
>>> review the meeting summary from the last review
>>> <https://github.com/sg16-unicode/sg16-meetings/#april-24th-2024>
>>> as well as the ensuing "Follow up on SG16 review of P2996R2"
>>> discussion on the SG16 mailing list
>>> <https://lists.isocpp.org/sg16/2024/04/index.php> prior to
>>> the meeting.
>>>
>>> 1. The character type(s) and encoding(s) used for names
>>> produced and consumed by reflection interfaces. My sense
>>> is that we're leaning in the following direction (not
>>> unanimously though):
>>> 1. Names will be produced and consumed in both the
>>> ordinary literal encoding via type char and UTF-8
>>> via type char8_t.
>>> 2. Production of names that contain characters that are
>>> not representable in the ordinary literal encoding
>>> will produce a string that contains a UCN-like
>>> escape sequence for such characters.
>>> 3. Consumption of names in the ordinary literal
>>> encoding will accept a UCN-like escape sequence for
>>> characters not in the basic literal character set
>>> that may lack representation in the ordinary literal
>>> encoding.
>>> 2. The use of a distinct type for names (e.g., a type that
>>> stores names in an internal representation and exposes
>>> them via char and char8_t interfaces).
>>> 3. Unicode NFC requirements (see below).
>>>
>>> We briefly discussed Unicode normalization form C (NFC) last
>>> time. Following adoption of P1949R7 (C++ Identifier Syntax
>>> using Unicode Standard Annex 31) <https://wg21.link/p1949r7>
>>> as a DR for C++23, identifiers are required to be written in
>>> NFC. Conversion to the ordinary literal encoding could
>>> result in names that are not in NFC. It will presumably be
>>> necessary for P2996 to specify that, for round-trip
>>> purposes, conversion to the ordinary literal encoding will
>>> not perform character substitutions (e.g., UNC-like escape
>>> sequences will be generated instead). Likewise, it will be
>>> necessary to specify how names that do not conform to NFC
>>> will be handled by reflection interfaces that consume user
>>> provided names. Note that current compiler releases exhibit
>>> implementation divergence with respect to enforcement of the
>>> NFC requirement (https://godbolt.org/z/E35r1K7hE; gcc does
>>> diagnose, Clang and EDG do not, MSVC does not yet implement
>>> P1949R7).
>>>
>>
>> No.
> Maybe! :)
>> We discussed that we cannot guarantee round tripping through
>> arbitrary encoding as there is no spec guaranteeing a mapping and
>> Unicode has duplicate representations of the same abstract
>> characters.
> That was the motivation for the "It will presumably be necessary
> ..." statement above.
>> This is no different than the mapping that happens in phase 1.
> There is a difference in that translation phase 1 is
> unidirectional. In this case, we have a round-trip requirement.
>> We observed that a lot of duplicate characters normalize to the
>> same thing, making it less of a concern.
>> I think we agreed (or at least that's where I wanted to get at),
>> that while we cannot promise round tripping in all cases, it's
>> not enough of a concern to worry about and ought not to impact
>> the design.
> That doesn't match my recollection. I had stated that more
> investigation and analysis is needed and that we weren't going to
> be able to resolve such questions during the last meeting.
>>
>> Whether a character can be represented at all in a non-unicode
>> encoding is a much more prevalent question than whether
>> duplicates round trip portably.
>> I remain strongly opposed to any form of invention along the
>> lines of novel escape sequences as this greatly reduce the
>> portability of C++ program and put undue burden on users/the
>> ecosystem
>
> I understand the resistance. We have several choices:
>
> 1. Make it impossible to name some identifiers in char-based
> interfaces when the ordinary literal encoding is not UTF-8.
> 2. Enable the ability to name all identifiers in char-based
> interfaces regardless of the choice of ordinary literal
> encoding by using some form of escape sequences.
> 3. Not support char-based interfaces at all.
>
> I would argue that a UNC-like escape sequence is not novel
> considering that std::format() already produces such escape
> sequences in [format.string.escaped]
> <http://eel.is/c++draft/format.string.escaped>. I'm confused by
> your statement that use of such escape sequences would harm
> portability. It seems to me that an escape sequence actually
> enables a way to write more portable programs. Perhaps we have a
> different understanding of the scope of what we're considering here.
>
> The design needs to be driven by use cases (and ideally not by
> theoretical concerns that should inform the specification but not
> necessarily the interface)
>
> There are 2 broad categories of use cases:
> 1/ Display the identifiers, for debugging, diagnostics and arguably
> documentation, i.e. use cases for which losing information is somewhat
> acceptable and somewhat unavoidable depending on scenario.
>
> 2/ Using the identifier to generate some code. That can be using
> reflection features, ie data_member_spec (although if you want to
> roundtrip there you don't actually care about the string, you just
> want to preserve identity; You would care if you wanted to do
> something like data_member_spec(std::format("{}_foo", name_of(^bar))) ).
> But there are lots of other cases where you would use the identifier
> to produce code. For example, run time reflection, python bindings, or
> arbitrary language binding. Maybe json serialization or what not.
>
> All of these scenario will have the same challenges:
>
> Python will not understand our custom escape sequence so it will just
> randomly fail on some identifiers until someone does the work to
> implement a "escaped c++ identifiers to utf8" in their python
> framework. or js framework, or all the 100s of tools that will exist.
> Same for runtime reflection. Nothing would understand the escape
> sequences, with the exceptions of some magic functions. And you would
> have to unescape them in any scenario that interacts with users or
> external systems.
> Same thing for json serialization, databases, network protocol, etc.
> Either C++ has to unescape (and unescaping forces us to answer the
> question that escaping was supposed to avoid), or external systems are
> burdened with that c++ oddities. Either way the users have to do extra
> work.
> Because ultimately, however you look at it, escaping is just an
> encoding mechanism.
> Now we have N+1 problem.
>
> Ultimately, I agree with Victor. If we really want to optimize for
> round tripping, and support both narrow encoding utf-8
> semi-transparently a magic object is not the worst idea.
> But if we are trying to find a text encoding scheme that does not lose
> data, I would suggest we use the one we already have :)

Thank you, Corentin, that perspective is most helpful to inform the
discussion.

Tom.

>
>>
>> (And yes, as discussed previously clang does not enforce
>> normalization yet.)
>
> Hmm, perhaps Clang should be claiming partial support for P1949R7
> here <https://clang.llvm.org/cxx_status.html#cxx23> then (with a
> footnote).
>
> Tom.
>
>> Thank you to Robin for pointing out an error in my use of
>> Compiler Explorer linked above; I neglected to add the
>> /source-charset:utf-8 option for MSVC, so the source code
>> wasn't interpreted correctly. Corrected at
>> https://godbolt.org/z/x1nxGfrYq; MSVC does not diagnose.
>> According to MSVC documenation
>> <https://learn.microsoft.com/en-us/cpp/overview/visual-cpp-language-conformance>,
>> P1949R7 is not yet implemented (Clang and EDG both document
>> it as implemented, but fail to diagnose).
>>
>> Tom.
>>
>>> Finally, and as a separable issue that can be discussed at
>>> another time, I think we should discuss differentiating
>>> between names and identifiers in the reflection interfaces.
>>> This isn't an issue for data_member_spec() since data
>>> members are always identifiers (or are unnamed; that is
>>> another interesting case, but isn't an SG16 concern), but
>>> could be an issue for a hypothetical function_spec() or
>>> member_function_spec() interface used for named functions,
>>> constructors and destructors, overloaded operators,
>>> conversion operators, user-defined literals, etc....
>>> Distinguishing between names and identifiers would avoid the
>>> need to parse, e.g., operator bool or ""_udl, when consuming
>>> names.
>>>
>>> Tom.
>>>
>>>
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>

Received on 2024-05-07 18:48:17