Date: Tue, 20 Feb 2024 09:10:44 +0100
On Tue, Feb 20, 2024 at 4:56 AM Tom Honermann via SG16 <
sg16_at_[hidden]> wrote:
> SG16 will hold a meeting on Wednesday, February 21st, at 19:30 UTC (timezone
> conversion
> <https://www.timeanddate.com/worldclock/converter.html?iso=20240221T193000&p1=1440&p2=tz_pst&p3=tz_mst&p4=tz_cst&p5=tz_est&p6=tz_cet>
> ).
>
> The agenda follows.
>
> - CWG 2843: Undated reference to Unicode makes C++ a moving target
> <https://cplusplus.github.io/CWG/issues/2843.html>
> - Identify updates needed for UAX #31 changes in Unicode 15.1.0.
> - LWG 4043: "ASCII" is not a registered character encoding
> <https://wg21.link/lwg4043>
> - LWG 4044: Confusing requirements for std::print on POSIX platforms
> <https://wg21.link/lwg4044>
>
> We reached consensus to recommend Unicode 15.1.0 as the minimum Unicode
> version and normative reference during the 2024-02-07 SG16 meeting
> <https://github.com/sg16-unicode/sg16-meetings?tab=readme-ov-file#february-7th-2024>.
> I thought this last discussion brought this issue to a conclusion for us,
> but an email sent to the WG14 mailing list by Joseph Myers (on 2024-02-13
> with subject "D.2.1 and UAX#31 revision 39") reminded me of an earlier email
> Corentin sent to the SG16 mailing list
> <https://lists.isocpp.org/sg16/2024/01/4041.php> (on 2024-01-06 with
> subject "UAX Profiles"). Changes made to UAX #31 (Unicode Identifiers and
> Syntax) <https://unicode.org/reports/tr31/> for Unicode 15.1.0 will
> require us to make a decision regarding accepting new character allowances
> in identifiers or adopting a profile to retain the Unicode 15.0.0
> allowances. In either case, changes to annex E (Conformance with UAX #31)
> <http://eel.is/c++draft/uaxid> will be required to reflect that rule
> UAX31-R1a (Restricted Format Characters) has been removed
> <https://www.unicode.org/reports/tr31/tr31-39.html#R1a>.
>
> A summary of the UAX #31 changes for Unicode 15.1.0 is provided in the "Modifications"
> section <https://www.unicode.org/reports/tr31/tr31-39.html#Modifications>.
> A diff of the changes relative to 15.0.0
> <https://www.unicode.org/reports/tr31/tr31-38.html> is also available. My
> understanding of the changes is that U+200C (ZERO WIDTH NON-JOINER) and
> U+200D (ZERO WIDTH JOINER) have been added to XID_continue
> <https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3AXID_continue%3A%5D&g=&i=>
> to allow for characters that native speakers of some languages (e.g.,
> Persian) would expect to be able to use in identifiers. Spoofing concerns,
> including those that depend on the presence of the ZWNJ and ZWJ characters,
> remain the subject matter of UTS #39 (Unicode Security Mechanisms)
> <https://unicode.org/reports/tr39/>. I expect that Robin, Corentin, and
> Steve will be able to provide more details of the change and its
> motivation. As I understand things, our choices will be to:
>
> 1. Accept the changes to XID_continue, or
>
> For what it's worth, this is what we ended up doing in Clang.
I think the Unicode expectation is that tooling (tooling is vague here, it
can be either compilers, static analysis, IDEs) would implement TR55 to
detect the problematic use cases (whether they do currently or not)
and ultimately deviating from Unicode (after spending so much time and
efforts synchronizing with Unicode) seemed unwise.
> 1. Reject the changes to XID_continue by adjusting the profile
> specified in [uaxid.def.general]
> <http://eel.is/c++draft/uaxid.def.general>, possibly by including the Default-Ignorable
> Exclusion Profile
> <https://www.unicode.org/reports/tr31/tr31-39.html#Default_Ignorable_Exclusion_Profile>,
> though that would exclude many code points
> <https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3AXID_continue%3A%5D%26%5B%3ADefault_Ignorable_Code_Point%3A%5D&g=&i=>
> beyond ZWNJ and ZWJ.
>
> Once consensus for a direction is established, a volunteer will be needed
> to draft wording changes for [uaxid] <http://eel.is/c++draft/uaxid>.
>
> LWG 4043 was recently filed by Jonathan Wakely. It reports a
> straightforward concern; that the set of encodings recognized by
> std::text_encoding does not include "ASCII" despite that name being
> unambiguous and recognized by common encoding libraries. The proposed
> resolution is to add "ASCII" to the set of aliases for that IANA specified
> "US-ASCII" encoding despite the fact that the IANA character set registry
> <https://www.iana.org/assignments/character-sets/character-sets.xhtml>
> does not do so.
>
Ship it
> LWG 4044 was also recently filed by Jonathan Wakely while working to
> implement std::print() in libstdc++. Jonathan's initial implementation
> attempted to do what the C++ standard wording stated and detected
> ill-formed code units written to a stream that is directed to a terminal so
> that they could be diagnosed. He found that the overhead of calling
> isatty() on Linux to determine if a stream is directed to a terminal was
> prohibitively expensive and started questioning why the standard was
> directing him to do this. In private correspondence, it was clarified that
> the intent of the "native Unicode API" terminology was to generically refer
> to the Windows WriteConsoleW() function and that there is no need to do
> anything special on POSIX systems. That discussion also questioned what it
> means to diagnose invalid code units written to a console at run-time.
> Jonathan has been kind enough to draft a proposed resolution to clarify the
> intent.
>
I feel like I'm repeating myself here, but any remotely decent terminal
does have logic to render invalid UTF-8. (which can be observed with echo
-ne "\0x80\n")
We do not have to take on responsibilities of the terminal on non-windows
platforms.
So "If invoking the native Unicode API does not require transcoding,
implementations are encouraged to diagnose invalid code units." is doing
duplicated work, with additional performance cost and complexity for an
outcome
that is, as best, identical to doing nothing
> Tom.
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
sg16_at_[hidden]> wrote:
> SG16 will hold a meeting on Wednesday, February 21st, at 19:30 UTC (timezone
> conversion
> <https://www.timeanddate.com/worldclock/converter.html?iso=20240221T193000&p1=1440&p2=tz_pst&p3=tz_mst&p4=tz_cst&p5=tz_est&p6=tz_cet>
> ).
>
> The agenda follows.
>
> - CWG 2843: Undated reference to Unicode makes C++ a moving target
> <https://cplusplus.github.io/CWG/issues/2843.html>
> - Identify updates needed for UAX #31 changes in Unicode 15.1.0.
> - LWG 4043: "ASCII" is not a registered character encoding
> <https://wg21.link/lwg4043>
> - LWG 4044: Confusing requirements for std::print on POSIX platforms
> <https://wg21.link/lwg4044>
>
> We reached consensus to recommend Unicode 15.1.0 as the minimum Unicode
> version and normative reference during the 2024-02-07 SG16 meeting
> <https://github.com/sg16-unicode/sg16-meetings?tab=readme-ov-file#february-7th-2024>.
> I thought this last discussion brought this issue to a conclusion for us,
> but an email sent to the WG14 mailing list by Joseph Myers (on 2024-02-13
> with subject "D.2.1 and UAX#31 revision 39") reminded me of an earlier email
> Corentin sent to the SG16 mailing list
> <https://lists.isocpp.org/sg16/2024/01/4041.php> (on 2024-01-06 with
> subject "UAX Profiles"). Changes made to UAX #31 (Unicode Identifiers and
> Syntax) <https://unicode.org/reports/tr31/> for Unicode 15.1.0 will
> require us to make a decision regarding accepting new character allowances
> in identifiers or adopting a profile to retain the Unicode 15.0.0
> allowances. In either case, changes to annex E (Conformance with UAX #31)
> <http://eel.is/c++draft/uaxid> will be required to reflect that rule
> UAX31-R1a (Restricted Format Characters) has been removed
> <https://www.unicode.org/reports/tr31/tr31-39.html#R1a>.
>
> A summary of the UAX #31 changes for Unicode 15.1.0 is provided in the "Modifications"
> section <https://www.unicode.org/reports/tr31/tr31-39.html#Modifications>.
> A diff of the changes relative to 15.0.0
> <https://www.unicode.org/reports/tr31/tr31-38.html> is also available. My
> understanding of the changes is that U+200C (ZERO WIDTH NON-JOINER) and
> U+200D (ZERO WIDTH JOINER) have been added to XID_continue
> <https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3AXID_continue%3A%5D&g=&i=>
> to allow for characters that native speakers of some languages (e.g.,
> Persian) would expect to be able to use in identifiers. Spoofing concerns,
> including those that depend on the presence of the ZWNJ and ZWJ characters,
> remain the subject matter of UTS #39 (Unicode Security Mechanisms)
> <https://unicode.org/reports/tr39/>. I expect that Robin, Corentin, and
> Steve will be able to provide more details of the change and its
> motivation. As I understand things, our choices will be to:
>
> 1. Accept the changes to XID_continue, or
>
> For what it's worth, this is what we ended up doing in Clang.
I think the Unicode expectation is that tooling (tooling is vague here, it
can be either compilers, static analysis, IDEs) would implement TR55 to
detect the problematic use cases (whether they do currently or not)
and ultimately deviating from Unicode (after spending so much time and
efforts synchronizing with Unicode) seemed unwise.
> 1. Reject the changes to XID_continue by adjusting the profile
> specified in [uaxid.def.general]
> <http://eel.is/c++draft/uaxid.def.general>, possibly by including the Default-Ignorable
> Exclusion Profile
> <https://www.unicode.org/reports/tr31/tr31-39.html#Default_Ignorable_Exclusion_Profile>,
> though that would exclude many code points
> <https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3AXID_continue%3A%5D%26%5B%3ADefault_Ignorable_Code_Point%3A%5D&g=&i=>
> beyond ZWNJ and ZWJ.
>
> Once consensus for a direction is established, a volunteer will be needed
> to draft wording changes for [uaxid] <http://eel.is/c++draft/uaxid>.
>
> LWG 4043 was recently filed by Jonathan Wakely. It reports a
> straightforward concern; that the set of encodings recognized by
> std::text_encoding does not include "ASCII" despite that name being
> unambiguous and recognized by common encoding libraries. The proposed
> resolution is to add "ASCII" to the set of aliases for that IANA specified
> "US-ASCII" encoding despite the fact that the IANA character set registry
> <https://www.iana.org/assignments/character-sets/character-sets.xhtml>
> does not do so.
>
Ship it
> LWG 4044 was also recently filed by Jonathan Wakely while working to
> implement std::print() in libstdc++. Jonathan's initial implementation
> attempted to do what the C++ standard wording stated and detected
> ill-formed code units written to a stream that is directed to a terminal so
> that they could be diagnosed. He found that the overhead of calling
> isatty() on Linux to determine if a stream is directed to a terminal was
> prohibitively expensive and started questioning why the standard was
> directing him to do this. In private correspondence, it was clarified that
> the intent of the "native Unicode API" terminology was to generically refer
> to the Windows WriteConsoleW() function and that there is no need to do
> anything special on POSIX systems. That discussion also questioned what it
> means to diagnose invalid code units written to a console at run-time.
> Jonathan has been kind enough to draft a proposed resolution to clarify the
> intent.
>
I feel like I'm repeating myself here, but any remotely decent terminal
does have logic to render invalid UTF-8. (which can be observed with echo
-ne "\0x80\n")
We do not have to take on responsibilities of the terminal on non-windows
platforms.
So "If invoking the native Unicode API does not require transcoding,
implementations are encouraged to diagnose invalid code units." is doing
duplicated work, with additional performance cost and complexity for an
outcome
that is, as best, identical to doing nothing
> Tom.
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
Received on 2024-02-20 08:11:04