C++ Logo

sg16

Advanced search

Agenda for the 2024-01-10 SG16 meeting

From: Tom Honermann <tom_at_[hidden]>
Date: Mon, 8 Jan 2024 19:29:15 -0500
Happy New Year! Time to get back to work...

SG16 will hold a meeting on Wednesday, January 10th, at 19:30 UTC
(timezone conversion
<https://www.timeanddate.com/worldclock/converter.html?iso=20240110T193000&p1=1440&p2=tz_pst&p3=tz_mst&p4=tz_cst&p5=tz_est&p6=tz_cet>).

The agenda follows.

  * CWG 2843: Undated reference to Unicode makes C++ a moving target
    <https://cplusplus.github.io/CWG/issues/2843.html>
  * P2626R0: charN_t incremental adoption: Casting pointers of UTF
    character types <https://wg21.link/p2626r0>

These are both big topics and I don't expect us to exhaust discussion of
either one during this meeting. We'll strive to limit discussion to 45
minutes for each.

CWG 2843 was recently created following Jonathan Wakely's post to the
SG16 mailing list <https://lists.isocpp.org/sg16/2024/01/4032.php>
regarding observable behavioral changes imposed by Unicode 15.1 relative
to Unicode 15.0. The initial concern Jonathan reported is that Unicode
15.1 changes the grapheme cluster segmentation rules in a way that will
impact the field width estimation performed by std::format (see
[format.string.std]p13 <https://eel.is/c++draft/format.string.std#13>).
Some code point sequences containing Indic text will be assigned a
shorter field width estimation as a result. The UTC decision is recorded
in section 5.5 (Grapheme clusters for Indic scripts) of L2/23-079
<https://www.unicode.org/L2/L2023/23079-utc175-properties-recs.pdf> and
the changes are noted in the Unicode 15.1 release announcement
<https://www.unicode.org/versions/Unicode15.1.0/>. The email discussion
has mainly focused on the ramifications of an undated normative
reference to the Unicode Standard as adopted for C++23 via P2736R2
(Referencing The Unicode Standard) <https://wg21.link/p2736r2>. That
adoption followed discussion of P2736 (Referencing The Unicode Standard)
<https://wg21.link/p2736> and the C++23 FR-010-133
<https://github.com/cplusplus/nbballot/issues/412> and FR-021-013
<https://github.com/cplusplus/nbballot/issues/423> NB comments during
the following SG16 meetings.

  * 2022-11-02 SG16 meeting
    <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2022.md#november-2nd-2022>
  * 2022-11-30 SG16 meeting
    <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2022.md#november-30th-2022>
  * 2022-12-14 SG16 meeting
    <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2022.md#december-14th-2022>
  * 2023-01-11 SG16 meeting
    <https://github.com/sg16-unicode/sg16-meetings#january-11th-2023>
  * 2023-01-25 SG16 meeting
    <https://github.com/sg16-unicode/sg16-meetings#january-25th-2023>

The undated reference to the Unicode Standard presumably provides
implementors a license to use whichever version of the Unicode Standard
is preferred and to change that version for their C++23 conformance
modes over time. This flexibility provides some benefits, but that comes
at the expense of portability guarantees. We'll discuss the status quo
and whether a change (presumably proposed as a C++23 DR) is warranted.
Possible changes include:

  * Mandating use of a particular Unicode Standard version for each C++
    standard. For example, C++23 might be specified to use Unicode 15
    and C++26 to use Unicode 19.
  * Mandating use of a minimum Unicode Standard version for each C++
    standard. In this case, the Unicode Standard version being used
    would be implementation-defined and implementors could choose to
    change it over time. For example, gcc 14.1 might use Unicode 15.1 in
    its C++23 conformance mode and gcc 17.1 might use Unicode 18.0 for
    both its C++23 and C++26 conformance modes.
  * Retaining the status quo.

The C++ standard does not currently acknowledge the potential for
different Unicode Standard versions to be used for core language
support, compile-time standard library features, and run-time standard
library features. Coordinating versions across the complete
implementation may not lie within the control of a single implementor;
particularly as we add new Unicode features that implementors might
prefer to provide via delegation to a platform Unicode support library.

P2626 was last discussed during the 2022-08-24 SG16 meeting
<https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2022.md#august-24th-2022>.
A couple of requests were made during that meeting that are yet to be
addressed in a new revision:

  * Victor requested that the paper be updated to explicitly state early
    in the paper what properties of the types must match for the
    operations to be well-formed.
  * Jens asked if the paper includes examples that are reflective of how
    this facility would be used in something like real world code.
    (I'm interpreting this as a request for such examples; the examples
    in the "Tony table" section of the paper are minimal)

See this SG16 email thread with subject "An alternative interface for
P2626R0 ..." <https://lists.isocpp.org/sg16/2022/09/3389.php> from
September, 2022 for some alternative considerations.

There are two primary design questions that I would like to see us make
progress on.

 1. How is (or should) the duration of access by one type vs another be
    managed?
 2. Should the ability to cast between underlying types be decoupled
    from the UTF concerns?

The proposed cast utilities return a pointer to the first element of the
array for the newly anointed type. Once one of those utilities is
invoked, access to the array and its elements must be performed via the
returned pointer; the utilities effectively opt-in to a pointer
provenance model of access and access via the original object becomes
UB. Since no facility is provided to undo those effects, there does not
appear to be a way to restore access through the original object
declaration. The following example illustrates a potentially desirable
use case; to provide a char8_t-based wrapper around an existing function
that processes UTF-8 text in char-based storage.

    void process_as_utf8(const char *p, size_t N) { ... }
    void process_as_utf8(const char8_t *p, size_t N) {
       process_as_utf8(cast_as_utf_unchecked(p, N));
    }
    void f() {
       char8_t text[] = u8"Zoom";
       process_as_utf8(text, sizeof(text));
       text[0] = u8'B'; // UB? How can access be restored?
    }

Use of these utilities in real world use cases will, I think, require
that the duration of their effects be precisely specified or that a
facility is provided to restore access via an existing variable
declaration. Since these utilities are intended to be used in constant
evaluation, implementations will be required to diagnose cases like the
above (during constant evaluation). As is, examples similar to the one
above can demonstrate surprising results since there is no defined point
at which mutations are commuted. https://godbolt.org/z/MGbfWKWb7.

There are use cases for a facility like the one that is proposed to
enable access to an object via an underlying type relationship. For
example, to load/store an object of enumeration type via an underlying
integer type. Decoupling the cast capabilities from the UTF concerns
would enable additional use cases. Given the existence of functions that
have a wide contract with respect to well-formed UTF input, is it
desirable for the cast facility to be concerned with encoding matters at
all? Does providing two cast operations (cast_as_utf, cast_utf_to) help
to prevent programming mistakes?

Tom.

Received on 2024-01-09 00:29:20