On 1/9/24 3:37 AM, Corentin Jabot wrote:


On Tue, Jan 9, 2024 at 1:29 AM Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:

Happy New Year! Time to get back to work...

SG16 will hold a meeting on Wednesday, January 10th, at 19:30 UTC (timezone conversion).

The agenda follows.

These are both big topics and I don't expect us to exhaust discussion of either one during this meeting. We'll strive to limit discussion to 45 minutes for each.

CWG 2843 was recently created following Jonathan Wakely's post to the SG16 mailing list regarding observable behavioral changes imposed by Unicode 15.1 relative to Unicode 15.0. The initial concern Jonathan reported is that Unicode 15.1 changes the grapheme cluster segmentation rules in a way that will impact the field width estimation performed by std::format (see [format.string.std]p13). Some code point sequences containing Indic text will be assigned a shorter field width estimation as a result.

The UTC decision is recorded in section 5.5 (Grapheme clusters for Indic scripts) of L2/23-079 and the changes are noted in the Unicode 15.1 release announcement. The email discussion has mainly focused on the ramifications of an undated normative reference to the Unicode Standard as adopted for C++23 via P2736R2 (Referencing The Unicode Standard). That adoption followed discussion of P2736 (Referencing The Unicode Standard) and the C++23 FR-010-133 and FR-021-013 NB comments during the following SG16 meetings.

The undated reference to the Unicode Standard presumably provides implementors a license to use whichever version of the Unicode Standard is preferred and to change that version for their C++23 conformance modes over time. This flexibility provides some benefits, but that comes at the expense of portability guarantees. We'll discuss the status quo and whether a change (presumably proposed as a C++23 DR) is warranted. Possible changes include:

  • Mandating use of a particular Unicode Standard version for each C++ standard. For example, C++23 might be specified to use Unicode 15 and C++26 to use Unicode 19.
  • Mandating use of a minimum Unicode Standard version for each C++ standard. In this case, the Unicode Standard version being used would be implementation-defined and implementors could choose to change it over time. For example, gcc 14.1 might use Unicode 15.1 in its C++23 conformance mode and gcc 17.1 might use Unicode 18.0 for both its C++23 and C++26 conformance modes.
As far as i can tell. This was the intent, i think it just got lost in translation
I do not expect implementers to be on board with having to support multiple Unicode versions (I am certainly not), the burden is just too great (in terms of work, data size and/or performance, and making it harder/impossible to use ici, icu4x or other library). 
We never got to the end of the discussion for the guarantees we make though. But really, we should not make more guarantees than what unicode provides including for width estimation.
For UAX, the stability concerns were greatly discussed.
This matches my recollection and intuition as well.

  • Retaining the status quo.

The C++ standard does not currently acknowledge the potential for different Unicode Standard versions to be used for core language support, compile-time standard library features, and run-time standard library features. Coordinating versions across the complete implementation may not lie within the control of a single implementor; particularly as we add new Unicode features that implementors might prefer to provide via delegation to a platform Unicode support library.

P2626 was last discussed during the 2022-08-24 SG16 meeting. A couple of requests were made during that meeting that are yet to be addressed in a new revision:

  • Victor requested that the paper be updated to explicitly state early in the paper what properties of the types must match for the operations to be well-formed.
  • Jens asked if the paper includes examples that are reflective of how this facility would be used in something like real world code.
    (I'm interpreting this as a request for such examples; the examples in the "Tony table" section of the paper are minimal)

See this SG16 email thread with subject "An alternative interface for P2626R0 ..." from September, 2022 for some alternative considerations.

There are two primary design questions that I would like to see us make progress on.

  1. How is (or should) the duration of access by one type vs another be managed?
  2. Should the ability to cast between underlying types be decoupled from the UTF concerns?

The proposed cast utilities return a pointer to the first element of the array for the newly anointed type. Once one of those utilities is invoked, access to the array and its elements must be performed via the returned pointer; the utilities effectively opt-in to a pointer provenance model of access and access via the original object becomes UB. Since no facility is provided to undo those effects, there does not appear to be a way to restore access through the original object declaration. The following example illustrates a potentially desirable use case; to provide a char8_t-based wrapper around an existing function that processes UTF-8 text in char-based storage.

Performing the reverse operation does ""reverse the effect"".
Can you show how you would modify the example below to perform that reversing operation? My understanding is that the call to cast_as_utf_unchecked() ends the lifetime of the char8_t array object associated with f()::text (this is what the proposed wording states) and creates a new sequence of char objects at the same location. Adding a call to cast_utf_to() would then end the lifetime of the sequence of char objects and create a new sequence of char8_t objects (again) in that same location, but I don't think this suffices to satisfy the transparently replaceable rules ([basic.life]p8) since the storage was reused. Perhaps you are envisioning an extension to those rules to allow this?
The reason no RAII wrapper is provided (that's what you are suggesting, right?) is because, should the caller take a copy of the input pointer, accessing it after the function returns could still lead to undesirable outcome.

I'm not necessarily advocating for an RAII wrapper and I can see use cases for explicitly managing the duration of valid access by the type of the cast.

What I am concerned about is how TBAA is preserved and at what point changes made via one type are reflected for aliasing purposes by the other type.

I keep repeating that but, in the absence of borrowing, this is a very sharp edge interface. It's in the same category as launder, unapologetically expert friendly.
It is both very dangerous and very necessary.
I agree; I would like to dull the edges as much as possible.
I do expect it to be used by ICU (to replace their current work arounds we know can be defeated), std::format and unicode facilities. Not in user code.

I think it would be helpful to include in the paper what the changes to ICU would look like.

I also think this facility is needed for user code so that authors of existing libraries that use char/wchar_t can provide wrapping interfaces for char8_t/charN_t.


Any attempt to provide a safe interface boils down to try to implement some kind of borrowing and exclusive ownership.
There is room to implement a RAII wrapper on top of this interface if someone needs that. And arguably, it might be the primary use case. But I'm concerned that hiding the sharpness might do more harm than good.


void process_as_utf8(const char *p, size_t N) { ... }
void process_as_utf8(const char8_t *p, size_t N) {
  process_as_utf8(cast_as_utf_unchecked(p, N));
}
void f() {
  char8_t text[] = u8"Zoom";
  process_as_utf8(text, sizeof(text));
  text[0] = u8'B'; // UB? How can access be restored?
}

Use of these utilities in real world use cases will, I think, require that the duration of their effects be precisely specified or that a facility is provided to restore access via an existing variable declaration. Since these utilities are intended to be used in constant evaluation, implementations will be required to diagnose cases like the above (during constant evaluation). As is, examples similar to the one above can demonstrate surprising results since there is no defined point at which mutations are commuted. https://godbolt.org/z/MGbfWKWb7.

There are use cases for a facility like the one that is proposed to enable access to an object via an underlying type relationship. For example, to load/store an object of enumeration type via an underlying integer type. Decoupling the cast capabilities from the UTF concerns would enable additional use cases. Given the existence of functions that have a wide contract with respect to well-formed UTF input, is it desirable for the cast facility to be concerned with encoding matters at all? Does providing two cast operations (cast_as_utf, cast_utf_to) help to prevent programming mistakes?

The motivation for this facility is that either
  - We can't afford a copy  (otherwise making a copy is preferable)
  - We want to retain contiguous ranges (otherwise a random access range doing a bit_cast is preferable)  
  - We need to retain a pointer for compat with existing interfaces consuming pointers.

Note that the second point has limited applicability. For all the range interfaces, random access to utf-8 data is not needed.
Someone who is using some simd library might need that though.

So I think this is quite specific.
Beside that this operation also changes the domain of the values and I think it's worth conveying in the interface, it's one less footgun.

That last sentence is the one that doesn't resonate with me. I don't see how tying these interfaces to encoding concerns reduces footguns. I find the names unintuitive since the data is always UTF encoded and the ordinary and wide character encodings might be UTF encodings; it is UTF text on both sides of each operation and the operations are not sensitive to the encodings or actual values. I only see the need for one interface; one that converts between types that share an underlying type.

Tom.