This is your friendly reminder that this meeting is taking place tomorrow.
Tom.
Happy New Year! Time to get back to work...
SG16 will hold a meeting on Wednesday, January 10th, at 19:30 UTC (timezone conversion).
The agenda follows.
- CWG 2843: Undated reference to Unicode makes C++ a moving target
- P2626R0: charN_t incremental adoption: Casting pointers of UTF character types
These are both big topics and I don't expect us to exhaust discussion of either one during this meeting. We'll strive to limit discussion to 45 minutes for each.
CWG 2843 was recently created following Jonathan Wakely's post to the SG16 mailing list regarding observable behavioral changes imposed by Unicode 15.1 relative to Unicode 15.0. The initial concern Jonathan reported is that Unicode 15.1 changes the grapheme cluster segmentation rules in a way that will impact the field width estimation performed by std::format (see [format.string.std]p13). Some code point sequences containing Indic text will be assigned a shorter field width estimation as a result. The UTC decision is recorded in section 5.5 (Grapheme clusters for Indic scripts) of L2/23-079 and the changes are noted in the Unicode 15.1 release announcement. The email discussion has mainly focused on the ramifications of an undated normative reference to the Unicode Standard as adopted for C++23 via P2736R2 (Referencing The Unicode Standard). That adoption followed discussion of P2736 (Referencing The Unicode Standard) and the C++23 FR-010-133 and FR-021-013 NB comments during the following SG16 meetings.
- 2022-11-02 SG16 meeting
- 2022-11-30 SG16 meeting
- 2022-12-14 SG16 meeting
- 2023-01-11 SG16 meeting
- 2023-01-25 SG16 meeting
The undated reference to the Unicode Standard presumably provides implementors a license to use whichever version of the Unicode Standard is preferred and to change that version for their C++23 conformance modes over time. This flexibility provides some benefits, but that comes at the expense of portability guarantees. We'll discuss the status quo and whether a change (presumably proposed as a C++23 DR) is warranted. Possible changes include:
- Mandating use of a particular Unicode Standard version for each C++ standard. For example, C++23 might be specified to use Unicode 15 and C++26 to use Unicode 19.
- Mandating use of a minimum Unicode Standard version for each C++ standard. In this case, the Unicode Standard version being used would be implementation-defined and implementors could choose to change it over time. For example, gcc 14.1 might use Unicode 15.1 in its C++23 conformance mode and gcc 17.1 might use Unicode 18.0 for both its C++23 and C++26 conformance modes.
- Retaining the status quo.
The C++ standard does not currently acknowledge the potential for different Unicode Standard versions to be used for core language support, compile-time standard library features, and run-time standard library features. Coordinating versions across the complete implementation may not lie within the control of a single implementor; particularly as we add new Unicode features that implementors might prefer to provide via delegation to a platform Unicode support library.
P2626 was last discussed during the 2022-08-24 SG16 meeting. A couple of requests were made during that meeting that are yet to be addressed in a new revision:
- Victor requested that the paper be updated to explicitly state early in the paper what properties of the types must match for the operations to be well-formed.
- Jens asked if the paper includes examples that are reflective of how this facility would be used in something like real world code.
(I'm interpreting this as a request for such examples; the examples in the "Tony table" section of the paper are minimal)See this SG16 email thread with subject "An alternative interface for P2626R0 ..." from September, 2022 for some alternative considerations.
There are two primary design questions that I would like to see us make progress on.
- How is (or should) the duration of access by one type vs another be managed?
- Should the ability to cast between underlying types be decoupled from the UTF concerns?
The proposed cast utilities return a pointer to the first element of the array for the newly anointed type. Once one of those utilities is invoked, access to the array and its elements must be performed via the returned pointer; the utilities effectively opt-in to a pointer provenance model of access and access via the original object becomes UB. Since no facility is provided to undo those effects, there does not appear to be a way to restore access through the original object declaration. The following example illustrates a potentially desirable use case; to provide a char8_t-based wrapper around an existing function that processes UTF-8 text in char-based storage.
void process_as_utf8(const char *p, size_t N) { ... }
void process_as_utf8(const char8_t *p, size_t N) {
process_as_utf8(cast_as_utf_unchecked(p, N));
}
void f() {
char8_t text[] = u8"Zoom";
process_as_utf8(text, sizeof(text));
text[0] = u8'B'; // UB? How can access be restored?
}
Use of these utilities in real world use cases will, I think, require that the duration of their effects be precisely specified or that a facility is provided to restore access via an existing variable declaration. Since these utilities are intended to be used in constant evaluation, implementations will be required to diagnose cases like the above (during constant evaluation). As is, examples similar to the one above can demonstrate surprising results since there is no defined point at which mutations are commuted. https://godbolt.org/z/MGbfWKWb7.
There are use cases for a facility like the one that is proposed to enable access to an object via an underlying type relationship. For example, to load/store an object of enumeration type via an underlying integer type. Decoupling the cast capabilities from the UTF concerns would enable additional use cases. Given the existence of functions that have a wide contract with respect to well-formed UTF input, is it desirable for the cast facility to be concerned with encoding matters at all? Does providing two cast operations (cast_as_utf, cast_utf_to) help to prevent programming mistakes?
Tom.