Oops, I failed to send my normal reminder yesterday, so this is your friendly reminder that this meeting is happening TODAY, in about 2 1/2 hours. See you soon!

Tom.

On 5/18/24 1:12 PM, Tom Honermann via SG16 wrote:

SG16 will hold a meeting on Wednesday, May 22nd, at 19:30 UTC (timezone conversion).

The agenda follows.

The results of the 2024 C++ Developer Survey were recently posted (summary results, detailed results). Question 6, "Which of these do you find frustrating about C++ development?", included a new response category this year, "Unicode, internationalization, and localization". Of the 17 categories, this one ranked 12th. The responses broke down as follows:

Approximately 46% of respondents claimed this category as a pain point. Not that we weren't already aware, but we clearly have work to do. 

I audited the write in responses that mentioned SG16 related terminology (Unicode, character, encoding, UTF, charN_t, text). The relevant comments follow; the portions in bold are comments with realistical and clear actionable complaints, requests, or suggestions.

Most of these comments aren't particularly actionable; what exactly does providing "better" or "full" Unicode support entail? Others, like those related to UTF-8-ing all the things, aren't feasible. My interpretation of the above is that we can make concrete improvements by doing the following:

  1. Add support for encoding conversions.
  2. Add support for charN_t in std::from_chars() and std::to_chars().
  3. Add support for Unicode-aware case conversions and case-insensitive comparisons.

Much of the following is copy/paste from the agenda sent for the 2024-01-10 SG16 meeting where I had planned for us to discuss P2626R0 but we then didn't due to time constraints.

P2626R0 was last discussed during the 2022-08-24 SG16 meeting. A few requests were made during that meeting and since that are yet to be addressed in a new revision:

See this SG16 email thread with subject "An alternative interface for P2626R0 ..." from September, 2022 for some alternative considerations.

There are two primary design questions that I would like to see us make progress on.

  1. How is (or should) the duration of access by one type vs another be managed?
  2. Should the ability to cast between underlying types be decoupled from the UTF concerns?

Consider the following example that illustrates a potentially desirable use case; to provide a char8_t-based wrapper around an existing function that processes UTF-8 text in char-based storage. The use of cast_as_utf_unchecked(), according to the proposed wording, ends the lifetime of the range of code units in the text array and returns a pointer to a new set of objects constructed in their place (with object representation preserved). Following that cast, access to the array and its elements must be performed via the returned pointer and access via the original object becomes UB.

void process_as_utf8(char *p, size_t N);
inline void process_as_utf8(char8_t *p, size_t N) {
  process_as_utf8(cast_as_utf_unchecked(p, N));
}
void f() {
  char8_t text[] = u8"Zoom";
  process_as_utf8(text, sizeof(text));
  CHECK(text[0] == u8'B'); // UB.
}

The paper does not propose an explicit "undo" operation, so it is unclear (at least to me) how valid access through the original object declaration can be restored. Perhaps the intent is that programmers do something like the following to undo a previous cast operation?

inline void process_as_utf8(const char8_t *p, size_t N) {
  const char *p_as_char =
cast_as_utf_unchecked(p, N);
  process_as_utf8(p_as_char);
  (void)cast_utf_to<char>(p_as_char, N);
}

What (I think) is missing is any connection to the original declaration of text; I am uncertain that the transparently replaceable rules ([basic.life]p8) suffice to cover this situation. I am concerned about how TBAA is preserved and at what point modifications made via one type are reflected for aliasing purposes by the other type; consider the case of the char overload of process_as_utf8() mutating the string with p[0] = 'B' as the CHECK() operation expects. We may have to seek guidance from CWG for these concerns.

Use of these utilities in real world use cases will, I think, require that the duration of their effects be precisely specified. Since these utilities are intended to be used in constant evaluation, implementations will be required to diagnose UB in cases like the above (during constant evaluation). As is, examples similar to the one above can demonstrate surprising results since, I think, there is no defined point at which mutations are commuted. https://godbolt.org/z/MGbfWKWb7 (unfortunately, that fork of Clang is broken at the moment).

There are use cases for a facility like the one that is proposed to enable access to an object via an underlying type relationship. For example, to load/store an object of enumeration type via an underlying integer type. Decoupling the cast capabilities from the UTF concerns would enable additional use cases. Given the existence of functions that have a wide contract with respect to well-formed UTF input, is it desirable for the cast facility to be concerned with encoding matters at all? Does providing two cast operations (cast_as_utf, cast_utf_to) help to prevent programming mistakes?

Tom.