Maybe! :)
On Mon, May 6, 2024 at 8:34 PM Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:
On 5/6/24 1:22 PM, Tom Honermann via SG16 wrote:
SG16 will hold a meeting on Wednesday, May 8th, at 19:30 UTC (timezone conversion).
The agenda follows.
D3258R0 was hastily produced by Corentin following the review of P2996R2 during the 2024-04-24 SG16 meeting with the goal of providing a convenient solution for printing UTF-8 text held in char8_t-based storage. It proposes extending std::format() and std::print() to support formatting arguments of Unicode character type (characters and strings of char8_t, char16_t, or char32_t type). It does not propose a solution for iostreams. We won't poll this paper during this meeting for two reasons: 1) the paper is hot off the press and I don't expect everyone to have already read it and internalized all the implications, and 2) I'm going to limit discussion of it to the first half of the meeting so that we continue to make progress on P2996. The intent in discussing it, particularly with the P2996 authors present, is to build a sense of whether it suffices to at least minimally address the printing requirements posed by the P2996 authors; we may take a poll on that point.
Our recent review of P2996R2 was constructive but not conclusive. We'll continue discussion with a goal of establishing consensus on the following points. Please review the meeting summary from the last review as well as the ensuing "Follow up on SG16 review of P2996R2" discussion on the SG16 mailing list prior to the meeting.
- The character type(s) and encoding(s) used for names produced and consumed by reflection interfaces. My sense is that we're leaning in the following direction (not unanimously though):
- Names will be produced and consumed in both the ordinary literal encoding via type char and UTF-8 via type char8_t.
- Production of names that contain characters that are not representable in the ordinary literal encoding will produce a string that contains a UCN-like escape sequence for such characters.
- Consumption of names in the ordinary literal encoding will accept a UCN-like escape sequence for characters not in the basic literal character set that may lack representation in the ordinary literal encoding.
- The use of a distinct type for names (e.g., a type that stores names in an internal representation and exposes them via char and char8_t interfaces).
- Unicode NFC requirements (see below).
We briefly discussed Unicode normalization form C (NFC) last time. Following adoption of P1949R7 (C++ Identifier Syntax using Unicode Standard Annex 31) as a DR for C++23, identifiers are required to be written in NFC. Conversion to the ordinary literal encoding could result in names that are not in NFC. It will presumably be necessary for P2996 to specify that, for round-trip purposes, conversion to the ordinary literal encoding will not perform character substitutions (e.g., UNC-like escape sequences will be generated instead). Likewise, it will be necessary to specify how names that do not conform to NFC will be handled by reflection interfaces that consume user provided names. Note that current compiler releases exhibit implementation divergence with respect to enforcement of the NFC requirement (https://godbolt.org/z/E35r1K7hE; gcc does diagnose, Clang and EDG do not, MSVC does not yet implement P1949R7).
No.
That was the motivation for the "It will presumably be necessary ..." statement above.We discussed that we cannot guarantee round tripping through arbitrary encoding as there is no spec guaranteeing a mapping and Unicode has duplicate representations of the same abstract characters.
There is a difference in that translation phase 1 is unidirectional. In this case, we have a round-trip requirement.This is no different than the mapping that happens in phase 1.
That doesn't match my recollection. I had stated that more investigation and analysis is needed and that we weren't going to be able to resolve such questions during the last meeting.We observed that a lot of duplicate characters normalize to the same thing, making it less of a concern.I think we agreed (or at least that's where I wanted to get at), that while we cannot promise round tripping in all cases, it's not enough of a concern to worry about and ought not to impact the design.
Whether a character can be represented at all in a non-unicode encoding is a much more prevalent question than whether duplicates round trip portably.I remain strongly opposed to any form of invention along the lines of novel escape sequences as this greatly reduce the portability of C++ program and put undue burden on users/the ecosystem
I understand the resistance. We have several choices:
I would argue that a UNC-like escape sequence is not novel considering that std::format() already produces such escape sequences in [format.string.escaped].
I'm confused by your statement that use of such escape sequences
would harm portability. It seems to me that an escape sequence
actually enables a way to write more portable programs. Perhaps we
have a different understanding of the scope of what we're
considering here.
(And yes, as discussed previously clang does not enforce normalization yet.)
Hmm, perhaps Clang should be claiming partial support for P1949R7
here
then (with a footnote).
Tom.
--Thank you to Robin for pointing out an error in my use of Compiler Explorer linked above; I neglected to add the /source-charset:utf-8 option for MSVC, so the source code wasn't interpreted correctly. Corrected at https://godbolt.org/z/x1nxGfrYq; MSVC does not diagnose. According to MSVC documenation, P1949R7 is not yet implemented (Clang and EDG both document it as implemented, but fail to diagnose).
Tom.
Finally, and as a separable issue that can be discussed at another time, I think we should discuss differentiating between names and identifiers in the reflection interfaces. This isn't an issue for data_member_spec() since data members are always identifiers (or are unnamed; that is another interesting case, but isn't an SG16 concern), but could be an issue for a hypothetical function_spec() or member_function_spec() interface used for named functions, constructors and destructors, overloaded operators, conversion operators, user-defined literals, etc.... Distinguishing between names and identifiers would avoid the need to parse, e.g., operator bool or ""_udl, when consuming names.
Tom.
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16