C++ Logo

sg16

Advanced search

Re: Agenda for the 2024-01-10 SG16 meeting

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 9 Jan 2024 16:21:58 -0500
On 1/9/24 3:37 AM, Corentin Jabot wrote:
>
>
> On Tue, Jan 9, 2024 at 1:29 AM Tom Honermann via SG16
> <sg16_at_[hidden]> wrote:
>
> Happy New Year! Time to get back to work...
>
> SG16 will hold a meeting on Wednesday, January 10th, at 19:30 UTC
> (timezone conversion
> <https://www.timeanddate.com/worldclock/converter.html?iso=20240110T193000&p1=1440&p2=tz_pst&p3=tz_mst&p4=tz_cst&p5=tz_est&p6=tz_cet>).
>
> The agenda follows.
>
> * CWG 2843: Undated reference to Unicode makes C++ a moving
> target <https://cplusplus.github.io/CWG/issues/2843.html>
> * P2626R0: charN_t incremental adoption: Casting pointers of UTF
> character types <https://wg21.link/p2626r0>
>
> These are both big topics and I don't expect us to exhaust
> discussion of either one during this meeting. We'll strive to
> limit discussion to 45 minutes for each.
>
> CWG 2843 was recently created following Jonathan Wakely's post to
> the SG16 mailing list
> <https://lists.isocpp.org/sg16/2024/01/4032.php> regarding
> observable behavioral changes imposed by Unicode 15.1 relative to
> Unicode 15.0. The initial concern Jonathan reported is that
> Unicode 15.1 changes the grapheme cluster segmentation rules in a
> way that will impact the field width estimation performed by
> std::format (see [format.string.std]p13
> <https://eel.is/c++draft/format.string.std#13>). Some code point
> sequences containing Indic text will be assigned a shorter field
> width estimation as a result.
>
> The UTC decision is recorded in section 5.5 (Grapheme clusters for
> Indic scripts) of L2/23-079
> <https://www.unicode.org/L2/L2023/23079-utc175-properties-recs.pdf>
> and the changes are noted in the Unicode 15.1 release announcement
> <https://www.unicode.org/versions/Unicode15.1.0/>. The email
> discussion has mainly focused on the ramifications of an undated
> normative reference to the Unicode Standard as adopted for C++23
> via P2736R2 (Referencing The Unicode Standard)
> <https://wg21.link/p2736r2>. That adoption followed discussion of
> P2736 (Referencing The Unicode Standard) <https://wg21.link/p2736>
> and the C++23 FR-010-133
> <https://github.com/cplusplus/nbballot/issues/412> and FR-021-013
> <https://github.com/cplusplus/nbballot/issues/423> NB comments
> during the following SG16 meetings.
>
> * 2022-11-02 SG16 meeting
> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2022.md#november-2nd-2022>
> * 2022-11-30 SG16 meeting
> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2022.md#november-30th-2022>
> * 2022-12-14 SG16 meeting
> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2022.md#december-14th-2022>
> * 2023-01-11 SG16 meeting
> <https://github.com/sg16-unicode/sg16-meetings#january-11th-2023>
> * 2023-01-25 SG16 meeting
> <https://github.com/sg16-unicode/sg16-meetings#january-25th-2023>
>
> The undated reference to the Unicode Standard presumably provides
> implementors a license to use whichever version of the Unicode
> Standard is preferred and to change that version for their C++23
> conformance modes over time. This flexibility provides some
> benefits, but that comes at the expense of portability guarantees.
> We'll discuss the status quo and whether a change (presumably
> proposed as a C++23 DR) is warranted. Possible changes include:
>
> * Mandating use of a particular Unicode Standard version for
> each C++ standard. For example, C++23 might be specified to
> use Unicode 15 and C++26 to use Unicode 19.
>
> * Mandating use of a minimum Unicode Standard version for each
> C++ standard. In this case, the Unicode Standard version being
> used would be implementation-defined and implementors could
> choose to change it over time. For example, gcc 14.1 might use
> Unicode 15.1 in its C++23 conformance mode and gcc 17.1 might
> use Unicode 18.0 for both its C++23 and C++26 conformance modes.
>
> As far as i can tell. This was the intent, i think it just got lost in
> translation
> I do not expect implementers to be on board with having to support
> multiple Unicode versions (I am certainly not), the burden is just too
> great (in terms of work, data size and/or performance, and making it
> harder/impossible to use ici, icu4x or other library).
> We never got to the end of the discussion for the guarantees we make
> though. But really, we should not make more guarantees than what
> unicode provides including for width estimation.
> For UAX, the stability concerns were greatly discussed.
This matches my recollection and intuition as well.
>
> * Retaining the status quo.
>
> The C++ standard does not currently acknowledge the potential for
> different Unicode Standard versions to be used for core language
> support, compile-time standard library features, and run-time
> standard library features. Coordinating versions across the
> complete implementation may not lie within the control of a single
> implementor; particularly as we add new Unicode features that
> implementors might prefer to provide via delegation to a platform
> Unicode support library.
>
> P2626 was last discussed during the 2022-08-24 SG16 meeting
> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2022.md#august-24th-2022>.
> A couple of requests were made during that meeting that are yet to
> be addressed in a new revision:
>
> * Victor requested that the paper be updated to explicitly state
> early in the paper what properties of the types must match for
> the operations to be well-formed.
> * Jens asked if the paper includes examples that are reflective
> of how this facility would be used in something like real
> world code.
> (I'm interpreting this as a request for such examples; the
> examples in the "Tony table" section of the paper are minimal)
>
> See this SG16 email thread with subject "An alternative interface
> for P2626R0 ..." <https://lists.isocpp.org/sg16/2022/09/3389.php>
> from September, 2022 for some alternative considerations.
>
> There are two primary design questions that I would like to see us
> make progress on.
>
> 1. How is (or should) the duration of access by one type vs
> another be managed?
> 2. Should the ability to cast between underlying types be
> decoupled from the UTF concerns?
>
> The proposed cast utilities return a pointer to the first element
> of the array for the newly anointed type. Once one of those
> utilities is invoked, access to the array and its elements must be
> performed via the returned pointer; the utilities effectively
> opt-in to a pointer provenance model of access and access via the
> original object becomes UB. Since no facility is provided to undo
> those effects, there does not appear to be a way to restore access
> through the original object declaration. The following example
> illustrates a potentially desirable use case; to provide a
> char8_t-based wrapper around an existing function that processes
> UTF-8 text in char-based storage.
>
> Performing the reverse operation does ""reverse the effect"".
Can you show how you would modify the example below to perform that
reversing operation? My understanding is that the call to
cast_as_utf_unchecked() ends the lifetime of the char8_t array object
associated with f()::text (this is what the proposed wording states) and
creates a new sequence of char objects at the same location. Adding a
call to cast_utf_to() would then end the lifetime of the sequence of
char objects and create a new sequence of char8_t objects (again) in
that same location, but I don't think this suffices to satisfy the
transparently replaceable rules ([basic.life]p8
<http://eel.is/c++draft/basic.life#8>) since the storage was reused.
Perhaps you are envisioning an extension to those rules to allow this?
> The reason no RAII wrapper is provided (that's what you are
> suggesting, right?) is because, should the caller take a copy of the
> input pointer, accessing it after the function returns could still
> lead to undesirable outcome.

I'm not necessarily advocating for an RAII wrapper and I can see use
cases for explicitly managing the duration of valid access by the type
of the cast.

What I am concerned about is how TBAA is preserved and at what point
changes made via one type are reflected for aliasing purposes by the
other type.

> I keep repeating that but, in the absence of borrowing, this is a very
> sharp edge interface. It's in the same category as launder,
> unapologetically expert friendly.
> It is both very dangerous and very necessary.
I agree; I would like to dull the edges as much as possible.
> I do expect it to be used by ICU (to replace their current work
> arounds we know can be defeated), std::format and unicode facilities.
> Not in user code.

I think it would be helpful to include in the paper what the changes to
ICU would look like.

I also think this facility is needed for user code so that authors of
existing libraries that use char/wchar_t can provide wrapping interfaces
for char8_t/charN_t.

>
> Any attempt to provide a safe interface boils down to try to implement
> some kind of borrowing and exclusive ownership.
> There is room to implement a RAII wrapper on top of this interface if
> someone needs that. And arguably, it might be the primary use case.
> But I'm concerned that hiding the sharpness might do more harm than good.
>
>
> void process_as_utf8(const char *p, size_t N) { ... }
> void process_as_utf8(const char8_t *p, size_t N) {
> process_as_utf8(cast_as_utf_unchecked(p, N));
> }
> void f() {
> char8_t text[] = u8"Zoom";
> process_as_utf8(text, sizeof(text));
> text[0] = u8'B'; // UB? How can access be restored?
> }
>
> Use of these utilities in real world use cases will, I think,
> require that the duration of their effects be precisely specified
> or that a facility is provided to restore access via an existing
> variable declaration. Since these utilities are intended to be
> used in constant evaluation, implementations will be required to
> diagnose cases like the above (during constant evaluation). As is,
> examples similar to the one above can demonstrate surprising
> results since there is no defined point at which mutations are
> commuted. https://godbolt.org/z/MGbfWKWb7.
>
> There are use cases for a facility like the one that is proposed
> to enable access to an object via an underlying type relationship.
> For example, to load/store an object of enumeration type via an
> underlying integer type. Decoupling the cast capabilities from the
> UTF concerns would enable additional use cases. Given the
> existence of functions that have a wide contract with respect to
> well-formed UTF input, is it desirable for the cast facility to be
> concerned with encoding matters at all? Does providing two cast
> operations (cast_as_utf, cast_utf_to) help to prevent programming
> mistakes?
>
> The motivation for this facility is that either
> - We can't afford a copy (otherwise making a copy is preferable)
> - We want to retain contiguous ranges (otherwise a random access
> range doing a bit_cast is preferable)
> - We need to retain a pointer for compat with existing interfaces
> consuming pointers.
>
> Note that the second point has limited applicability. For all the
> range interfaces, random access to utf-8 data is not needed.
> Someone who is using some simd library might need that though.
>
> So I think this is quite specific.
> Beside that this operation also changes the domain of the values and I
> think it's worth conveying in the interface, it's one less footgun.

That last sentence is the one that doesn't resonate with me. I don't see
how tying these interfaces to encoding concerns reduces footguns. I find
the names unintuitive since the data is always UTF encoded and the
ordinary and wide character encodings might be UTF encodings; it is UTF
text on both sides of each operation and the operations are not
sensitive to the encodings or actual values. I only see the need for one
interface; one that converts between types that share an underlying type.

Tom.

>
>
> Tom.
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2024-01-09 21:22:02