Date: Tue, 7 May 2024 11:57:24 -0400
On 5/7/24 3:19 AM, Corentin Jabot wrote:
>
>
> On Mon, May 6, 2024 at 8:34 PM Tom Honermann via SG16
> <sg16_at_[hidden]> wrote:
>
> On 5/6/24 1:22 PM, Tom Honermann via SG16 wrote:
>>
>> SG16 will hold a meeting on Wednesday, May 8th, at 19:30 UTC
>> (timezone conversion
>> <https://www.timeanddate.com/worldclock/converter.html?iso=20240508T193000&p1=1440&p2=tz_pdt&p3=tz_mdt&p4=tz_cdt&p5=tz_edt&p6=tz_cest>).
>>
>> The agenda follows.
>>
>> * D3258R0: Formatting of charN_t <https://wg21.link/d3258r0>.
>> * P2996R2: Reflection for C++26 <http://wg21.link/p2996r2>.
>>
>> D3258R0 was hastily produced by Corentin following the review of
>> P2996R2 during the 2024-04-24 SG16 meeting
>> <https://github.com/sg16-unicode/sg16-meetings/#april-24th-2024>
>> with the goal of providing a convenient solution for printing
>> UTF-8 text held in char8_t-based storage. It proposes extending
>> std::format() and std::print() to support formatting arguments of
>> Unicode character type (characters and strings of char8_t,
>> char16_t, or char32_t type). It does not propose a solution for
>> iostreams. We won't poll this paper during this meeting for two
>> reasons: 1) the paper is hot off the press and I don't expect
>> everyone to have already read it and internalized all the
>> implications, and 2) I'm going to limit discussion of it to the
>> first half of the meeting so that we continue to make progress on
>> P2996. The intent in discussing it, particularly with the P2996
>> authors present, is to build a sense of whether it suffices to at
>> least minimally address the printing requirements posed by the
>> P2996 authors; we may take a poll on that point.
>>
>> Our recent review of P2996R2 was constructive but not conclusive.
>> We'll continue discussion with a goal of establishing consensus
>> on the following points. Please review the meeting summary from
>> the last review
>> <https://github.com/sg16-unicode/sg16-meetings/#april-24th-2024>
>> as well as the ensuing "Follow up on SG16 review of P2996R2"
>> discussion on the SG16 mailing list
>> <https://lists.isocpp.org/sg16/2024/04/index.php> prior to the
>> meeting.
>>
>> 1. The character type(s) and encoding(s) used for names produced
>> and consumed by reflection interfaces. My sense is that we're
>> leaning in the following direction (not unanimously though):
>> 1. Names will be produced and consumed in both the ordinary
>> literal encoding via type char and UTF-8 via type char8_t.
>> 2. Production of names that contain characters that are not
>> representable in the ordinary literal encoding will
>> produce a string that contains a UCN-like escape sequence
>> for such characters.
>> 3. Consumption of names in the ordinary literal encoding
>> will accept a UCN-like escape sequence for characters not
>> in the basic literal character set that may lack
>> representation in the ordinary literal encoding.
>> 2. The use of a distinct type for names (e.g., a type that
>> stores names in an internal representation and exposes them
>> via char and char8_t interfaces).
>> 3. Unicode NFC requirements (see below).
>>
>> We briefly discussed Unicode normalization form C (NFC) last
>> time. Following adoption of P1949R7 (C++ Identifier Syntax using
>> Unicode Standard Annex 31) <https://wg21.link/p1949r7> as a DR
>> for C++23, identifiers are required to be written in NFC.
>> Conversion to the ordinary literal encoding could result in names
>> that are not in NFC. It will presumably be necessary for P2996 to
>> specify that, for round-trip purposes, conversion to the ordinary
>> literal encoding will not perform character substitutions (e.g.,
>> UNC-like escape sequences will be generated instead). Likewise,
>> it will be necessary to specify how names that do not conform to
>> NFC will be handled by reflection interfaces that consume user
>> provided names. Note that current compiler releases exhibit
>> implementation divergence with respect to enforcement of the NFC
>> requirement (https://godbolt.org/z/E35r1K7hE; gcc does diagnose,
>> Clang and EDG do not, MSVC does not yet implement P1949R7).
>>
>
> No.
Maybe! :)
> We discussed that we cannot guarantee round tripping through arbitrary
> encoding as there is no spec guaranteeing a mapping and Unicode has
> duplicate representations of the same abstract characters.
That was the motivation for the "It will presumably be necessary ..."
statement above.
> This is no different than the mapping that happens in phase 1.
There is a difference in that translation phase 1 is unidirectional. In
this case, we have a round-trip requirement.
> We observed that a lot of duplicate characters normalize to the same
> thing, making it less of a concern.
> I think we agreed (or at least that's where I wanted to get at), that
> while we cannot promise round tripping in all cases, it's not enough
> of a concern to worry about and ought not to impact the design.
That doesn't match my recollection. I had stated that more investigation
and analysis is needed and that we weren't going to be able to resolve
such questions during the last meeting.
>
> Whether a character can be represented at all in a non-unicode
> encoding is a much more prevalent question than whether duplicates
> round trip portably.
> I remain strongly opposed to any form of invention along the lines of
> novel escape sequences as this greatly reduce the portability of C++
> program and put undue burden on users/the ecosystem
I understand the resistance. We have several choices:
1. Make it impossible to name some identifiers in char-based interfaces
when the ordinary literal encoding is not UTF-8.
2. Enable the ability to name all identifiers in char-based interfaces
regardless of the choice of ordinary literal encoding by using some
form of escape sequences.
3. Not support char-based interfaces at all.
I would argue that a UNC-like escape sequence is not novel considering
that std::format() already produces such escape sequences in
[format.string.escaped] <http://eel.is/c++draft/format.string.escaped>.
I'm confused by your statement that use of such escape sequences would
harm portability. It seems to me that an escape sequence actually
enables a way to write more portable programs. Perhaps we have a
different understanding of the scope of what we're considering here.
>
> (And yes, as discussed previously clang does not enforce normalization
> yet.)
Hmm, perhaps Clang should be claiming partial support for P1949R7 here
<https://clang.llvm.org/cxx_status.html#cxx23> then (with a footnote).
Tom.
> Thank you to Robin for pointing out an error in my use of Compiler
> Explorer linked above; I neglected to add the
> /source-charset:utf-8 option for MSVC, so the source code wasn't
> interpreted correctly. Corrected at
> https://godbolt.org/z/x1nxGfrYq; MSVC does not diagnose. According
> to MSVC documenation
> <https://learn.microsoft.com/en-us/cpp/overview/visual-cpp-language-conformance>,
> P1949R7 is not yet implemented (Clang and EDG both document it as
> implemented, but fail to diagnose).
>
> Tom.
>
>> Finally, and as a separable issue that can be discussed at
>> another time, I think we should discuss differentiating between
>> names and identifiers in the reflection interfaces. This isn't an
>> issue for data_member_spec() since data members are always
>> identifiers (or are unnamed; that is another interesting case,
>> but isn't an SG16 concern), but could be an issue for a
>> hypothetical function_spec() or member_function_spec() interface
>> used for named functions, constructors and destructors,
>> overloaded operators, conversion operators, user-defined
>> literals, etc.... Distinguishing between names and identifiers
>> would avoid the need to parse, e.g., operator bool or ""_udl,
>> when consuming names.
>>
>> Tom.
>>
>>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
>
>
> On Mon, May 6, 2024 at 8:34 PM Tom Honermann via SG16
> <sg16_at_[hidden]> wrote:
>
> On 5/6/24 1:22 PM, Tom Honermann via SG16 wrote:
>>
>> SG16 will hold a meeting on Wednesday, May 8th, at 19:30 UTC
>> (timezone conversion
>> <https://www.timeanddate.com/worldclock/converter.html?iso=20240508T193000&p1=1440&p2=tz_pdt&p3=tz_mdt&p4=tz_cdt&p5=tz_edt&p6=tz_cest>).
>>
>> The agenda follows.
>>
>> * D3258R0: Formatting of charN_t <https://wg21.link/d3258r0>.
>> * P2996R2: Reflection for C++26 <http://wg21.link/p2996r2>.
>>
>> D3258R0 was hastily produced by Corentin following the review of
>> P2996R2 during the 2024-04-24 SG16 meeting
>> <https://github.com/sg16-unicode/sg16-meetings/#april-24th-2024>
>> with the goal of providing a convenient solution for printing
>> UTF-8 text held in char8_t-based storage. It proposes extending
>> std::format() and std::print() to support formatting arguments of
>> Unicode character type (characters and strings of char8_t,
>> char16_t, or char32_t type). It does not propose a solution for
>> iostreams. We won't poll this paper during this meeting for two
>> reasons: 1) the paper is hot off the press and I don't expect
>> everyone to have already read it and internalized all the
>> implications, and 2) I'm going to limit discussion of it to the
>> first half of the meeting so that we continue to make progress on
>> P2996. The intent in discussing it, particularly with the P2996
>> authors present, is to build a sense of whether it suffices to at
>> least minimally address the printing requirements posed by the
>> P2996 authors; we may take a poll on that point.
>>
>> Our recent review of P2996R2 was constructive but not conclusive.
>> We'll continue discussion with a goal of establishing consensus
>> on the following points. Please review the meeting summary from
>> the last review
>> <https://github.com/sg16-unicode/sg16-meetings/#april-24th-2024>
>> as well as the ensuing "Follow up on SG16 review of P2996R2"
>> discussion on the SG16 mailing list
>> <https://lists.isocpp.org/sg16/2024/04/index.php> prior to the
>> meeting.
>>
>> 1. The character type(s) and encoding(s) used for names produced
>> and consumed by reflection interfaces. My sense is that we're
>> leaning in the following direction (not unanimously though):
>> 1. Names will be produced and consumed in both the ordinary
>> literal encoding via type char and UTF-8 via type char8_t.
>> 2. Production of names that contain characters that are not
>> representable in the ordinary literal encoding will
>> produce a string that contains a UCN-like escape sequence
>> for such characters.
>> 3. Consumption of names in the ordinary literal encoding
>> will accept a UCN-like escape sequence for characters not
>> in the basic literal character set that may lack
>> representation in the ordinary literal encoding.
>> 2. The use of a distinct type for names (e.g., a type that
>> stores names in an internal representation and exposes them
>> via char and char8_t interfaces).
>> 3. Unicode NFC requirements (see below).
>>
>> We briefly discussed Unicode normalization form C (NFC) last
>> time. Following adoption of P1949R7 (C++ Identifier Syntax using
>> Unicode Standard Annex 31) <https://wg21.link/p1949r7> as a DR
>> for C++23, identifiers are required to be written in NFC.
>> Conversion to the ordinary literal encoding could result in names
>> that are not in NFC. It will presumably be necessary for P2996 to
>> specify that, for round-trip purposes, conversion to the ordinary
>> literal encoding will not perform character substitutions (e.g.,
>> UNC-like escape sequences will be generated instead). Likewise,
>> it will be necessary to specify how names that do not conform to
>> NFC will be handled by reflection interfaces that consume user
>> provided names. Note that current compiler releases exhibit
>> implementation divergence with respect to enforcement of the NFC
>> requirement (https://godbolt.org/z/E35r1K7hE; gcc does diagnose,
>> Clang and EDG do not, MSVC does not yet implement P1949R7).
>>
>
> No.
Maybe! :)
> We discussed that we cannot guarantee round tripping through arbitrary
> encoding as there is no spec guaranteeing a mapping and Unicode has
> duplicate representations of the same abstract characters.
That was the motivation for the "It will presumably be necessary ..."
statement above.
> This is no different than the mapping that happens in phase 1.
There is a difference in that translation phase 1 is unidirectional. In
this case, we have a round-trip requirement.
> We observed that a lot of duplicate characters normalize to the same
> thing, making it less of a concern.
> I think we agreed (or at least that's where I wanted to get at), that
> while we cannot promise round tripping in all cases, it's not enough
> of a concern to worry about and ought not to impact the design.
That doesn't match my recollection. I had stated that more investigation
and analysis is needed and that we weren't going to be able to resolve
such questions during the last meeting.
>
> Whether a character can be represented at all in a non-unicode
> encoding is a much more prevalent question than whether duplicates
> round trip portably.
> I remain strongly opposed to any form of invention along the lines of
> novel escape sequences as this greatly reduce the portability of C++
> program and put undue burden on users/the ecosystem
I understand the resistance. We have several choices:
1. Make it impossible to name some identifiers in char-based interfaces
when the ordinary literal encoding is not UTF-8.
2. Enable the ability to name all identifiers in char-based interfaces
regardless of the choice of ordinary literal encoding by using some
form of escape sequences.
3. Not support char-based interfaces at all.
I would argue that a UNC-like escape sequence is not novel considering
that std::format() already produces such escape sequences in
[format.string.escaped] <http://eel.is/c++draft/format.string.escaped>.
I'm confused by your statement that use of such escape sequences would
harm portability. It seems to me that an escape sequence actually
enables a way to write more portable programs. Perhaps we have a
different understanding of the scope of what we're considering here.
>
> (And yes, as discussed previously clang does not enforce normalization
> yet.)
Hmm, perhaps Clang should be claiming partial support for P1949R7 here
<https://clang.llvm.org/cxx_status.html#cxx23> then (with a footnote).
Tom.
> Thank you to Robin for pointing out an error in my use of Compiler
> Explorer linked above; I neglected to add the
> /source-charset:utf-8 option for MSVC, so the source code wasn't
> interpreted correctly. Corrected at
> https://godbolt.org/z/x1nxGfrYq; MSVC does not diagnose. According
> to MSVC documenation
> <https://learn.microsoft.com/en-us/cpp/overview/visual-cpp-language-conformance>,
> P1949R7 is not yet implemented (Clang and EDG both document it as
> implemented, but fail to diagnose).
>
> Tom.
>
>> Finally, and as a separable issue that can be discussed at
>> another time, I think we should discuss differentiating between
>> names and identifiers in the reflection interfaces. This isn't an
>> issue for data_member_spec() since data members are always
>> identifiers (or are unnamed; that is another interesting case,
>> but isn't an SG16 concern), but could be an issue for a
>> hypothetical function_spec() or member_function_spec() interface
>> used for named functions, constructors and destructors,
>> overloaded operators, conversion operators, user-defined
>> literals, etc.... Distinguishing between names and identifiers
>> would avoid the need to parse, e.g., operator bool or ""_udl,
>> when consuming names.
>>
>> Tom.
>>
>>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
Received on 2024-05-07 15:57:27