ISOCPP sg16 List: Re: Agenda for the 2024-01-10 SG16 meeting

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 9 Jan 2024 16:27:07 -0500

This is your friendly reminder that this meeting is taking place tomorrow.

Tom.

On 1/8/24 7:29 PM, Tom Honermann via SG16 wrote:
>
> Happy New Year! Time to get back to work...
>
> SG16 will hold a meeting on Wednesday, January 10th, at 19:30 UTC
> (timezone conversion
> <https://www.timeanddate.com/worldclock/converter.html?iso=20240110T193000&p1=1440&p2=tz_pst&p3=tz_mst&p4=tz_cst&p5=tz_est&p6=tz_cet>).
>
> The agenda follows.
>
> * CWG 2843: Undated reference to Unicode makes C++ a moving target
> <https://cplusplus.github.io/CWG/issues/2843.html>
> * P2626R0: charN_t incremental adoption: Casting pointers of UTF
> character types <https://wg21.link/p2626r0>
>
> These are both big topics and I don't expect us to exhaust discussion
> of either one during this meeting. We'll strive to limit discussion to
> 45 minutes for each.
>
> CWG 2843 was recently created following Jonathan Wakely's post to the
> SG16 mailing list <https://lists.isocpp.org/sg16/2024/01/4032.php>
> regarding observable behavioral changes imposed by Unicode 15.1
> relative to Unicode 15.0. The initial concern Jonathan reported is
> that Unicode 15.1 changes the grapheme cluster segmentation rules in a
> way that will impact the field width estimation performed by
> std::format (see [format.string.std]p13
> <https://eel.is/c++draft/format.string.std#13>). Some code point
> sequences containing Indic text will be assigned a shorter field width
> estimation as a result. The UTC decision is recorded in section 5.5
> (Grapheme clusters for Indic scripts) of L2/23-079
> <https://www.unicode.org/L2/L2023/23079-utc175-properties-recs.pdf>
> and the changes are noted in the Unicode 15.1 release announcement
> <https://www.unicode.org/versions/Unicode15.1.0/>. The email
> discussion has mainly focused on the ramifications of an undated
> normative reference to the Unicode Standard as adopted for C++23 via
> P2736R2 (Referencing The Unicode Standard)
> <https://wg21.link/p2736r2>. That adoption followed discussion of
> P2736 (Referencing The Unicode Standard) <https://wg21.link/p2736> and
> the C++23 FR-010-133
> <https://github.com/cplusplus/nbballot/issues/412> and FR-021-013
> <https://github.com/cplusplus/nbballot/issues/423> NB comments during
> the following SG16 meetings.
>
> * 2022-11-02 SG16 meeting
> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2022.md#november-2nd-2022>
> * 2022-11-30 SG16 meeting
> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2022.md#november-30th-2022>
> * 2022-12-14 SG16 meeting
> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2022.md#december-14th-2022>
> * 2023-01-11 SG16 meeting
> <https://github.com/sg16-unicode/sg16-meetings#january-11th-2023>
> * 2023-01-25 SG16 meeting
> <https://github.com/sg16-unicode/sg16-meetings#january-25th-2023>
>
> The undated reference to the Unicode Standard presumably provides
> implementors a license to use whichever version of the Unicode
> Standard is preferred and to change that version for their C++23
> conformance modes over time. This flexibility provides some benefits,
> but that comes at the expense of portability guarantees. We'll discuss
> the status quo and whether a change (presumably proposed as a C++23
> DR) is warranted. Possible changes include:
>
> * Mandating use of a particular Unicode Standard version for each
> C++ standard. For example, C++23 might be specified to use Unicode
> 15 and C++26 to use Unicode 19.
> * Mandating use of a minimum Unicode Standard version for each C++
> standard. In this case, the Unicode Standard version being used
> would be implementation-defined and implementors could choose to
> change it over time. For example, gcc 14.1 might use Unicode 15.1
> in its C++23 conformance mode and gcc 17.1 might use Unicode 18.0
> for both its C++23 and C++26 conformance modes.
> * Retaining the status quo.
>
> The C++ standard does not currently acknowledge the potential for
> different Unicode Standard versions to be used for core language
> support, compile-time standard library features, and run-time standard
> library features. Coordinating versions across the complete
> implementation may not lie within the control of a single implementor;
> particularly as we add new Unicode features that implementors might
> prefer to provide via delegation to a platform Unicode support library.
>
> P2626 was last discussed during the 2022-08-24 SG16 meeting
> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2022.md#august-24th-2022>.
> A couple of requests were made during that meeting that are yet to be
> addressed in a new revision:
>
> * Victor requested that the paper be updated to explicitly state
> early in the paper what properties of the types must match for the
> operations to be well-formed.
> * Jens asked if the paper includes examples that are reflective of
> how this facility would be used in something like real world code.
> (I'm interpreting this as a request for such examples; the
> examples in the "Tony table" section of the paper are minimal)
>
> See this SG16 email thread with subject "An alternative interface for
> P2626R0 ..." <https://lists.isocpp.org/sg16/2022/09/3389.php> from
> September, 2022 for some alternative considerations.
>
> There are two primary design questions that I would like to see us
> make progress on.
>
> 1. How is (or should) the duration of access by one type vs another
> be managed?
> 2. Should the ability to cast between underlying types be decoupled
> from the UTF concerns?
>
> The proposed cast utilities return a pointer to the first element of
> the array for the newly anointed type. Once one of those utilities is
> invoked, access to the array and its elements must be performed via
> the returned pointer; the utilities effectively opt-in to a pointer
> provenance model of access and access via the original object becomes
> UB. Since no facility is provided to undo those effects, there does
> not appear to be a way to restore access through the original object
> declaration. The following example illustrates a potentially desirable
> use case; to provide a char8_t-based wrapper around an existing
> function that processes UTF-8 text in char-based storage.
>
> void process_as_utf8(const char *p, size_t N) { ... }
> void process_as_utf8(const char8_t *p, size_t N) {
> process_as_utf8(cast_as_utf_unchecked(p, N));
> }
> void f() {
> char8_t text[] = u8"Zoom";
> process_as_utf8(text, sizeof(text));
> text[0] = u8'B'; // UB? How can access be restored?
> }
>
> Use of these utilities in real world use cases will, I think, require
> that the duration of their effects be precisely specified or that a
> facility is provided to restore access via an existing variable
> declaration. Since these utilities are intended to be used in constant
> evaluation, implementations will be required to diagnose cases like
> the above (during constant evaluation). As is, examples similar to the
> one above can demonstrate surprising results since there is no defined
> point at which mutations are commuted. https://godbolt.org/z/MGbfWKWb7.
>
> There are use cases for a facility like the one that is proposed to
> enable access to an object via an underlying type relationship. For
> example, to load/store an object of enumeration type via an underlying
> integer type. Decoupling the cast capabilities from the UTF concerns
> would enable additional use cases. Given the existence of functions
> that have a wide contract with respect to well-formed UTF input, is it
> desirable for the cast facility to be concerned with encoding matters
> at all? Does providing two cast operations (cast_as_utf, cast_utf_to)
> help to prevent programming mistakes?
>
> Tom.
>
>

Received on 2024-01-09 21:27:09