ISOCPP sg16 List: Re: Agenda for the 2024-01-10 SG16 meeting

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Tue, 9 Jan 2024 09:37:53 +0100

On Tue, Jan 9, 2024 at 1:29 AM Tom Honermann via SG16 <sg16_at_[hidden]>
wrote:

> Happy New Year! Time to get back to work...
>
> SG16 will hold a meeting on Wednesday, January 10th, at 19:30 UTC (timezone
> conversion
> <https://www.timeanddate.com/worldclock/converter.html?iso=20240110T193000&p1=1440&p2=tz_pst&p3=tz_mst&p4=tz_cst&p5=tz_est&p6=tz_cet>
> ).
>
> The agenda follows.
>
> - CWG 2843: Undated reference to Unicode makes C++ a moving target
> <https://cplusplus.github.io/CWG/issues/2843.html>
> - P2626R0: charN_t incremental adoption: Casting pointers of UTF
> character types <https://wg21.link/p2626r0>
>
> These are both big topics and I don't expect us to exhaust discussion of
> either one during this meeting. We'll strive to limit discussion to 45
> minutes for each.
>
> CWG 2843 was recently created following Jonathan Wakely's post to the
> SG16 mailing list <https://lists.isocpp.org/sg16/2024/01/4032.php>
> regarding observable behavioral changes imposed by Unicode 15.1 relative to
> Unicode 15.0. The initial concern Jonathan reported is that Unicode 15.1
> changes the grapheme cluster segmentation rules in a way that will impact
> the field width estimation performed by std::format (see
> [format.string.std]p13 <https://eel.is/c++draft/format.string.std#13>).
> Some code point sequences containing Indic text will be assigned a shorter
> field width estimation as a result.
>
The UTC decision is recorded in section 5.5 (Grapheme clusters for Indic
> scripts) of L2/23-079
> <https://www.unicode.org/L2/L2023/23079-utc175-properties-recs.pdf> and
> the changes are noted in the Unicode 15.1 release announcement
> <https://www.unicode.org/versions/Unicode15.1.0/>. The email discussion
> has mainly focused on the ramifications of an undated normative reference
> to the Unicode Standard as adopted for C++23 via P2736R2 (Referencing The
> Unicode Standard) <https://wg21.link/p2736r2>. That adoption followed
> discussion of P2736 (Referencing The Unicode Standard)
> <https://wg21.link/p2736> and the C++23 FR-010-133
> <https://github.com/cplusplus/nbballot/issues/412> and FR-021-013
> <https://github.com/cplusplus/nbballot/issues/423> NB comments during the
> following SG16 meetings.
>
> - 2022-11-02 SG16 meeting
> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2022.md#november-2nd-2022>
> - 2022-11-30 SG16 meeting
> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2022.md#november-30th-2022>
> - 2022-12-14 SG16 meeting
> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2022.md#december-14th-2022>
> - 2023-01-11 SG16 meeting
> <https://github.com/sg16-unicode/sg16-meetings#january-11th-2023>
> - 2023-01-25 SG16 meeting
> <https://github.com/sg16-unicode/sg16-meetings#january-25th-2023>
>
> The undated reference to the Unicode Standard presumably provides
> implementors a license to use whichever version of the Unicode Standard is
> preferred and to change that version for their C++23 conformance modes over
> time. This flexibility provides some benefits, but that comes at the
> expense of portability guarantees. We'll discuss the status quo and whether
> a change (presumably proposed as a C++23 DR) is warranted. Possible changes
> include:
>
> - Mandating use of a particular Unicode Standard version for each C++
> standard. For example, C++23 might be specified to use Unicode 15 and C++26
> to use Unicode 19.
>
>
> - Mandating use of a minimum Unicode Standard version for each C++
> standard. In this case, the Unicode Standard version being used would be
> implementation-defined and implementors could choose to change it over
> time. For example, gcc 14.1 might use Unicode 15.1 in its C++23 conformance
> mode and gcc 17.1 might use Unicode 18.0 for both its C++23 and C++26
> conformance modes.
>
> As far as i can tell. This was the intent, i think it just got lost in
translation
I do not expect implementers to be on board with having to support multiple
Unicode versions (I am certainly not), the burden is just too great (in
terms of work, data size and/or performance, and making it
harder/impossible to use ici, icu4x or other library).
We never got to the end of the discussion for the guarantees we make
though. But really, we should not make more guarantees than what unicode
provides including for width estimation.
For UAX, the stability concerns were greatly discussed.

> - Retaining the status quo.
>
> The C++ standard does not currently acknowledge the potential for
> different Unicode Standard versions to be used for core language support,
> compile-time standard library features, and run-time standard library
> features. Coordinating versions across the complete implementation may not
> lie within the control of a single implementor; particularly as we add new
> Unicode features that implementors might prefer to provide via delegation
> to a platform Unicode support library.
>
> P2626 was last discussed during the 2022-08-24 SG16 meeting
> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2022.md#august-24th-2022>.
> A couple of requests were made during that meeting that are yet to be
> addressed in a new revision:
>
> - Victor requested that the paper be updated to explicitly state early
> in the paper what properties of the types must match for the operations to
> be well-formed.
> - Jens asked if the paper includes examples that are reflective of how
> this facility would be used in something like real world code.
> (I'm interpreting this as a request for such examples; the examples in
> the "Tony table" section of the paper are minimal)
>
> See this SG16 email thread with subject "An alternative interface for
> P2626R0 ..." <https://lists.isocpp.org/sg16/2022/09/3389.php> from
> September, 2022 for some alternative considerations.
>
> There are two primary design questions that I would like to see us make
> progress on.
>
> 1. How is (or should) the duration of access by one type vs another be
> managed?
> 2. Should the ability to cast between underlying types be decoupled
> from the UTF concerns?
>
> The proposed cast utilities return a pointer to the first element of the
> array for the newly anointed type. Once one of those utilities is invoked,
> access to the array and its elements must be performed via the returned
> pointer; the utilities effectively opt-in to a pointer provenance model of
> access and access via the original object becomes UB. Since no facility is
> provided to undo those effects, there does not appear to be a way to
> restore access through the original object declaration. The following
> example illustrates a potentially desirable use case; to provide a char8_t-based
> wrapper around an existing function that processes UTF-8 text in char-based
> storage.
>
Performing the reverse operation does ""reverse the effect"".
The reason no RAII wrapper is provided (that's what you are suggesting,
right?) is because, should the caller take a copy of the input pointer,
accessing it after the function returns could still lead to undesirable
outcome.
I keep repeating that but, in the absence of borrowing, this is a very
sharp edge interface. It's in the same category as launder,
unapologetically expert friendly.
It is both very dangerous and very necessary.
I do expect it to be used by ICU (to replace their current work arounds we
know can be defeated), std::format and unicode facilities. Not in user code.

Any attempt to provide a safe interface boils down to try to implement some
kind of borrowing and exclusive ownership.
There is room to implement a RAII wrapper on top of this interface if
someone needs that. And arguably, it might be the primary use case. But I'm
concerned that hiding the sharpness might do more harm than good.

void process_as_utf8(const char *p, size_t N) { ... }
> void process_as_utf8(const char8_t *p, size_t N) {
> process_as_utf8(cast_as_utf_unchecked(p, N));
> }
> void f() {
> char8_t text[] = u8"Zoom";
> process_as_utf8(text, sizeof(text));
> text[0] = u8'B'; // UB? How can access be restored?
> }
>
> Use of these utilities in real world use cases will, I think, require that
> the duration of their effects be precisely specified or that a facility is
> provided to restore access via an existing variable declaration. Since
> these utilities are intended to be used in constant evaluation,
> implementations will be required to diagnose cases like the above (during
> constant evaluation). As is, examples similar to the one above can
> demonstrate surprising results since there is no defined point at which
> mutations are commuted. https://godbolt.org/z/MGbfWKWb7.
>
> There are use cases for a facility like the one that is proposed to enable
> access to an object via an underlying type relationship. For example, to
> load/store an object of enumeration type via an underlying integer type.
> Decoupling the cast capabilities from the UTF concerns would enable
> additional use cases. Given the existence of functions that have a wide
> contract with respect to well-formed UTF input, is it desirable for the
> cast facility to be concerned with encoding matters at all? Does providing
> two cast operations (cast_as_utf, cast_utf_to) help to prevent
> programming mistakes?
>
The motivation for this facility is that either
  - We can't afford a copy (otherwise making a copy is preferable)
  - We want to retain contiguous ranges (otherwise a random access range
doing a bit_cast is preferable)
  - We need to retain a pointer for compat with existing interfaces
consuming pointers.

Note that the second point has limited applicability. For all the range
interfaces, random access to utf-8 data is not needed.
Someone who is using some simd library might need that though.

So I think this is quite specific.
Beside that this operation also changes the domain of the values and I
think it's worth conveying in the interface, it's one less footgun.

> Tom.
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2024-01-09 08:38:13