Date: Tue, 7 May 2024 09:19:24 +0200
On Mon, May 6, 2024 at 8:34 PM Tom Honermann via SG16 <sg16_at_[hidden]>
wrote:
> On 5/6/24 1:22 PM, Tom Honermann via SG16 wrote:
>
> SG16 will hold a meeting on Wednesday, May 8th, at 19:30 UTC (timezone
> conversion
> <https://www.timeanddate.com/worldclock/converter.html?iso=20240508T193000&p1=1440&p2=tz_pdt&p3=tz_mdt&p4=tz_cdt&p5=tz_edt&p6=tz_cest>
> ).
>
> The agenda follows.
>
> - D3258R0: Formatting of charN_t <https://wg21.link/d3258r0>.
> - P2996R2: Reflection for C++26 <http://wg21.link/p2996r2>.
>
> D3258R0 was hastily produced by Corentin following the review of P2996R2
> during the 2024-04-24 SG16 meeting
> <https://github.com/sg16-unicode/sg16-meetings/#april-24th-2024> with the
> goal of providing a convenient solution for printing UTF-8 text held in
> char8_t-based storage. It proposes extending std::format() and
> std::print() to support formatting arguments of Unicode character type
> (characters and strings of char8_t, char16_t, or char32_t type). It does
> not propose a solution for iostreams. We won't poll this paper during this
> meeting for two reasons: 1) the paper is hot off the press and I don't
> expect everyone to have already read it and internalized all the
> implications, and 2) I'm going to limit discussion of it to the first half
> of the meeting so that we continue to make progress on P2996. The intent in
> discussing it, particularly with the P2996 authors present, is to build a
> sense of whether it suffices to at least minimally address the printing
> requirements posed by the P2996 authors; we may take a poll on that point.
>
> Our recent review of P2996R2 was constructive but not conclusive. We'll
> continue discussion with a goal of establishing consensus on the following
> points. Please review the meeting summary from the last review
> <https://github.com/sg16-unicode/sg16-meetings/#april-24th-2024> as well
> as the ensuing "Follow up on SG16 review of P2996R2" discussion on the
> SG16 mailing list <https://lists.isocpp.org/sg16/2024/04/index.php> prior
> to the meeting.
>
> 1. The character type(s) and encoding(s) used for names produced and
> consumed by reflection interfaces. My sense is that we're leaning in the
> following direction (not unanimously though):
> 1. Names will be produced and consumed in both the ordinary literal
> encoding via type char and UTF-8 via type char8_t.
> 2. Production of names that contain characters that are not
> representable in the ordinary literal encoding will produce a string that
> contains a UCN-like escape sequence for such characters.
> 3. Consumption of names in the ordinary literal encoding will
> accept a UCN-like escape sequence for characters not in the basic literal
> character set that may lack representation in the ordinary literal encoding.
> 2. The use of a distinct type for names (e.g., a type that stores
> names in an internal representation and exposes them via char and
> char8_t interfaces).
> 3. Unicode NFC requirements (see below).
>
> We briefly discussed Unicode normalization form C (NFC) last time.
> Following adoption of P1949R7 (C++ Identifier Syntax using Unicode
> Standard Annex 31) <https://wg21.link/p1949r7> as a DR for C++23,
> identifiers are required to be written in NFC. Conversion to the ordinary
> literal encoding could result in names that are not in NFC. It will
> presumably be necessary for P2996 to specify that, for round-trip purposes,
> conversion to the ordinary literal encoding will not perform character
> substitutions (e.g., UNC-like escape sequences will be generated instead).
> Likewise, it will be necessary to specify how names that do not conform to
> NFC will be handled by reflection interfaces that consume user provided
> names. Note that current compiler releases exhibit implementation
> divergence with respect to enforcement of the NFC requirement (
> https://godbolt.org/z/E35r1K7hE; gcc does diagnose, Clang and EDG do not,
> MSVC does not yet implement P1949R7).
>
>
No.
We discussed that we cannot guarantee round tripping through arbitrary
encoding as there is no spec guaranteeing a mapping and Unicode has
duplicate representations of the same abstract characters.
This is no different than the mapping that happens in phase 1.
We observed that a lot of duplicate characters normalize to the same thing,
making it less of a concern.
I think we agreed (or at least that's where I wanted to get at), that while
we cannot promise round tripping in all cases, it's not enough of a concern
to worry about and ought not to impact the design.
Whether a character can be represented at all in a non-unicode encoding is
a much more prevalent question than whether duplicates round trip portably.
I remain strongly opposed to any form of invention along the lines of novel
escape sequences as this greatly reduce the portability of C++ program and
put undue burden on users/the ecosystem
(And yes, as discussed previously clang does not enforce normalization yet.)
> Thank you to Robin for pointing out an error in my use of Compiler
> Explorer linked above; I neglected to add the /source-charset:utf-8
> option for MSVC, so the source code wasn't interpreted correctly. Corrected
> at https://godbolt.org/z/x1nxGfrYq; MSVC does not diagnose. According to MSVC
> documenation
> <https://learn.microsoft.com/en-us/cpp/overview/visual-cpp-language-conformance>,
> P1949R7 is not yet implemented (Clang and EDG both document it as
> implemented, but fail to diagnose).
>
> Tom.
>
> Finally, and as a separable issue that can be discussed at another time, I
> think we should discuss differentiating between names and identifiers in
> the reflection interfaces. This isn't an issue for data_member_spec()
> since data members are always identifiers (or are unnamed; that is another
> interesting case, but isn't an SG16 concern), but could be an issue for a
> hypothetical function_spec() or member_function_spec() interface used for
> named functions, constructors and destructors, overloaded operators,
> conversion operators, user-defined literals, etc.... Distinguishing between
> names and identifiers would avoid the need to parse, e.g., operator bool
> or ""_udl, when consuming names.
>
> Tom.
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
wrote:
> On 5/6/24 1:22 PM, Tom Honermann via SG16 wrote:
>
> SG16 will hold a meeting on Wednesday, May 8th, at 19:30 UTC (timezone
> conversion
> <https://www.timeanddate.com/worldclock/converter.html?iso=20240508T193000&p1=1440&p2=tz_pdt&p3=tz_mdt&p4=tz_cdt&p5=tz_edt&p6=tz_cest>
> ).
>
> The agenda follows.
>
> - D3258R0: Formatting of charN_t <https://wg21.link/d3258r0>.
> - P2996R2: Reflection for C++26 <http://wg21.link/p2996r2>.
>
> D3258R0 was hastily produced by Corentin following the review of P2996R2
> during the 2024-04-24 SG16 meeting
> <https://github.com/sg16-unicode/sg16-meetings/#april-24th-2024> with the
> goal of providing a convenient solution for printing UTF-8 text held in
> char8_t-based storage. It proposes extending std::format() and
> std::print() to support formatting arguments of Unicode character type
> (characters and strings of char8_t, char16_t, or char32_t type). It does
> not propose a solution for iostreams. We won't poll this paper during this
> meeting for two reasons: 1) the paper is hot off the press and I don't
> expect everyone to have already read it and internalized all the
> implications, and 2) I'm going to limit discussion of it to the first half
> of the meeting so that we continue to make progress on P2996. The intent in
> discussing it, particularly with the P2996 authors present, is to build a
> sense of whether it suffices to at least minimally address the printing
> requirements posed by the P2996 authors; we may take a poll on that point.
>
> Our recent review of P2996R2 was constructive but not conclusive. We'll
> continue discussion with a goal of establishing consensus on the following
> points. Please review the meeting summary from the last review
> <https://github.com/sg16-unicode/sg16-meetings/#april-24th-2024> as well
> as the ensuing "Follow up on SG16 review of P2996R2" discussion on the
> SG16 mailing list <https://lists.isocpp.org/sg16/2024/04/index.php> prior
> to the meeting.
>
> 1. The character type(s) and encoding(s) used for names produced and
> consumed by reflection interfaces. My sense is that we're leaning in the
> following direction (not unanimously though):
> 1. Names will be produced and consumed in both the ordinary literal
> encoding via type char and UTF-8 via type char8_t.
> 2. Production of names that contain characters that are not
> representable in the ordinary literal encoding will produce a string that
> contains a UCN-like escape sequence for such characters.
> 3. Consumption of names in the ordinary literal encoding will
> accept a UCN-like escape sequence for characters not in the basic literal
> character set that may lack representation in the ordinary literal encoding.
> 2. The use of a distinct type for names (e.g., a type that stores
> names in an internal representation and exposes them via char and
> char8_t interfaces).
> 3. Unicode NFC requirements (see below).
>
> We briefly discussed Unicode normalization form C (NFC) last time.
> Following adoption of P1949R7 (C++ Identifier Syntax using Unicode
> Standard Annex 31) <https://wg21.link/p1949r7> as a DR for C++23,
> identifiers are required to be written in NFC. Conversion to the ordinary
> literal encoding could result in names that are not in NFC. It will
> presumably be necessary for P2996 to specify that, for round-trip purposes,
> conversion to the ordinary literal encoding will not perform character
> substitutions (e.g., UNC-like escape sequences will be generated instead).
> Likewise, it will be necessary to specify how names that do not conform to
> NFC will be handled by reflection interfaces that consume user provided
> names. Note that current compiler releases exhibit implementation
> divergence with respect to enforcement of the NFC requirement (
> https://godbolt.org/z/E35r1K7hE; gcc does diagnose, Clang and EDG do not,
> MSVC does not yet implement P1949R7).
>
>
No.
We discussed that we cannot guarantee round tripping through arbitrary
encoding as there is no spec guaranteeing a mapping and Unicode has
duplicate representations of the same abstract characters.
This is no different than the mapping that happens in phase 1.
We observed that a lot of duplicate characters normalize to the same thing,
making it less of a concern.
I think we agreed (or at least that's where I wanted to get at), that while
we cannot promise round tripping in all cases, it's not enough of a concern
to worry about and ought not to impact the design.
Whether a character can be represented at all in a non-unicode encoding is
a much more prevalent question than whether duplicates round trip portably.
I remain strongly opposed to any form of invention along the lines of novel
escape sequences as this greatly reduce the portability of C++ program and
put undue burden on users/the ecosystem
(And yes, as discussed previously clang does not enforce normalization yet.)
> Thank you to Robin for pointing out an error in my use of Compiler
> Explorer linked above; I neglected to add the /source-charset:utf-8
> option for MSVC, so the source code wasn't interpreted correctly. Corrected
> at https://godbolt.org/z/x1nxGfrYq; MSVC does not diagnose. According to MSVC
> documenation
> <https://learn.microsoft.com/en-us/cpp/overview/visual-cpp-language-conformance>,
> P1949R7 is not yet implemented (Clang and EDG both document it as
> implemented, but fail to diagnose).
>
> Tom.
>
> Finally, and as a separable issue that can be discussed at another time, I
> think we should discuss differentiating between names and identifiers in
> the reflection interfaces. This isn't an issue for data_member_spec()
> since data members are always identifiers (or are unnamed; that is another
> interesting case, but isn't an SG16 concern), but could be an issue for a
> hypothetical function_spec() or member_function_spec() interface used for
> named functions, constructors and destructors, overloaded operators,
> conversion operators, user-defined literals, etc.... Distinguishing between
> names and identifiers would avoid the need to parse, e.g., operator bool
> or ""_udl, when consuming names.
>
> Tom.
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
Received on 2024-05-07 07:19:44