ISOCPP sg16 List: [isocpp-sg16] Agenda for the 2024-05-22 SG16 meeting

From: Tom Honermann <tom_at_[hidden]>
Date: Sat, 18 May 2024 13:12:44 -0400

SG16 will hold a meeting on Wednesday, May 22nd, at 19:30 UTC (timezone
conversion
<https://www.timeanddate.com/worldclock/converter.html?iso=20240522T193000&p1=1440&p2=tz_pdt&p3=tz_mdt&p4=tz_cdt&p5=tz_edt&p6=tz_cest>).

The agenda follows.

  * Fraser to report on the May 3rd Text Terminal WG meeting.
  * Review results of the 2024 C++ Developer Survey.
  * P2626R0: charN_t incremental adoption: Casting pointers of UTF
    character types <https://wg21.link/p2626r0>.

The results of the 2024 C++ Developer Survey were recently posted
(summary results
<https://wiki.edg.com/pub/Wg21tokyo2024/Documents/CppDevSurvey-2024-summary.pdf>,
detailed results
<https://wiki.edg.com/pub/Wg21tokyo2024/Documents/CppDevSurvey-2024-writeins.pdf>).
Question 6, "Which of these do you find frustrating about C++
development?", included a new response category this year, "Unicode,
internationalization, and localization". Of the 17 categories, this one
ranked 12th. The responses broke down as follows:

  * _Major pain point_
    16.56%, 205 respondents.
  * _Minor pain point_
    29.32%, 363 respondents.
  * _Not a significant issue_
    54.12%, 670 respondents.

Approximately 46% of respondents claimed this category as a pain point.
Not that we weren't already aware, but we clearly have work to do.

I audited the write in responses that mentioned SG16 related terminology
(Unicode, character, encoding, UTF, char/N/_t, text). The relevant
comments follow; the portions in bold are comments with realistical and
clear actionable complaints, requests, or suggestions.

  * Unicode seems to be progressing nicely.
  * Unicode support. It should be standard in C++11, let alone <current
    year>.
  * Lack of Unicode - *no (clean and efficient) std functionality for
    converting from UTF-8 (char[8_t])/UTF-16 (char16_t)/UTF-32
    (char32_t) to any of the other types* *Lack of other character
    support: - like [to/from]_chars only supporting char, not char16_t
    or char32_t* even though it is based on implementations that do
    support all char types (at least the from_chars).
  * Some of the features I expected have not come out (network, Unicode
    support and so on) whereas it's already part of other languages
    standard library
  * Removed unicode support
  * Valuable committee time is wasted in discussing such facilities
    while there is STILL no reasonable Unicode support (we are talking
    about text, simple text!).
  * Basic things like networking and unicode helpers are not present in
    standard library.
  * I guess first class utf8/unicode support not improving as fast as
    I'd like it to. In 2024 it's still not very easy to write
    Unicode-aware apps which seamlessly deal with encodings, conversions
    between encodings - none of this is a first-class citizen of the
    language.
  * Internationalization and unicode support.
  * Better Unicode support in STL.
  * Full Unicode support.
  * Unicode friendly stl support for localization.
  * Proper Unicode support. In MS Windows development, virtually all
    user input is UTF-16LE in the form of wchar_t and variants. I
    convert that to UTF-8 via wrapper functions that use third-party
    Unicode libraries (uni-algo in my case) that (can) use std::string.
    *Things that should be simple but aren't in Unicode, like case
    conversion and case-insensitive comparison, should be provided for.*
    This would reduce the pain point of third-party libraries.
  * Unicode very important.
  * I would change the way characters and strings are represented. The
    Rust model is so much better. In practice, that means the character
    type is not integral, there are no null terminators, and everything
    is UTF-8 by default.
  * STL: missing basic components (filesystem / network / UTF-8
    encodings), not specified implementation of e.g. std::string (e.g.
    Implicit Sharing).
  * std::text_encoding.
  * utf8 support across platforms.
  * char8_t and breaking change of u8"" string literals I've been using
    a relaxed variant of the "UTF8 everywhere" manifesto in my Windows
    app with zero problems for over a decade, so std::string rules the
    roost for UTF-8 with me. C++20 char8_t and breaking u8"" behavior
    gets in the way. Need to use non-portable techniques of naked UTF-8
    string literals via MSVC /utf-8 option.
  * make utf8 the one and only type of string in the entire universe!
  * remove wide strings since they are not wide enoough on some
    platforms and just use std::strings as utf8.
  * reconsider cases where std:: can do things, but it's a horrible mess
    (like ASCII->UTF-8) to be more development and readability focussed.
    deprecate std::Xstream << operator overloading - it's horribly
    unreasonable for young devs to learn about operator overloading in
    their hello world apps, and there's 1000 more things wrong with
    those streams...
  * Utf8 std::string.
  * Ditching char8_t.
  * Whole char8_t fiasco (introduction of this type is a mistake).

Most of these comments aren't particularly actionable; what exactly does
providing "better" or "full" Unicode support entail? Others, like those
related to UTF-8-ing all the things, aren't feasible. My interpretation
of the above is that we can make concrete improvements by doing the
following:

1. Add support for encoding conversions.
2. Add support for char/N/_t in std::from_chars() and std::to_chars().
3. Add support for Unicode-aware case conversions and case-insensitive
    comparisons.

Much of the following is copy/paste from the agenda sent for the
2024-01-10 SG16 meeting <https://lists.isocpp.org/sg16/2024/01/4080.php>
where I had planned for us to discuss P2626R0 but we then didn't due to
time constraints.

P2626R0 <https://wg21.link/p2626r0> was last discussed during the
2022-08-24 SG16 meeting
<https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2022.md#august-24th-2022>.
A few requests were made during that meeting and since that are yet to
be addressed in a new revision:

  * Victor requested that the paper be updated to explicitly state early
    in the paper what properties of the types must match for the
    operations to be well-formed.
  * Jens asked if the paper includes examples that are reflective of how
    this facility would be used in something like real world code.
    (I'm interpreting this as a request for such examples; the examples
    in the "Tony table" section of the paper are minimal)
  * Tom requested that the paper include an example of changes that
    might be made to ICU to use the proposed facilities. E.g., how
    U_ALIASING_BARRIER
    <https://github.com/unicode-org/icu/blob/0ef4da943c1cfc694e84fcb85cee5c78bae89d71/icu4c/source/common/unicode/char16ptr.h#L30-L36>
    and its uses would be changed.

See this SG16 email thread with subject "An alternative interface for
P2626R0 ..." <https://lists.isocpp.org/sg16/2022/09/3389.php> from
September, 2022 for some alternative considerations.

There are two primary design questions that I would like to see us make
progress on.

1. How is (or should) the duration of access by one type vs another be
    managed?
2. Should the ability to cast between underlying types be decoupled
    from the UTF concerns?

Consider the following example that illustrates a potentially desirable
use case; to provide a char8_t-based wrapper around an existing function
that processes UTF-8 text in char-based storage. The use of
cast_as_utf_unchecked(), according to the proposed wording, ends the
lifetime of the range of code units in the text array and returns a
pointer to a new set of objects constructed in their place (with object
representation preserved). Following that cast, access to the array and
its elements must be performed via the returned pointer and access via
the original object becomes UB.

    void process_as_utf8(char *p, size_t N);
    inline void process_as_utf8(char8_t *p, size_t N) {
       process_as_utf8(*cast_as_utf_unchecked*(p, N));
    }
    void f() {
       char8_t text[] = u8"Zoom";
       process_as_utf8(text, sizeof(text));
    CHECK(text[0] == u8'B'); // UB.
    }

The paper does not propose an explicit "undo" operation, so it is
unclear (at least to me) how valid access through the original object
declaration can be restored. Perhaps the intent is that programmers do
something like the following to undo a previous cast operation?

    inline void process_as_utf8(const char8_t *p, size_t N) {
       const char *p_as_char = *cast_as_utf_unchecked*(p, N);
       process_as_utf8(p_as_char);
    (void)*cast_utf_to<char>*(p_as_char, N);
    }

What (I think) is missing is any connection to the original declaration
of text; I am uncertain that the transparently replaceable rules
([basic.life]p8 <http://eel.is/c++draft/basic.life#8>) suffice to cover
this situation. I am concerned about how TBAA is preserved and at what
point modifications made via one type are reflected for aliasing
purposes by the other type; consider the case of the char overload of
process_as_utf8() mutating the string with p[0] = 'B' as the CHECK()
operation expects. We may have to seek guidance from CWG for these concerns.

Use of these utilities in real world use cases will, I think, require
that the duration of their effects be precisely specified. Since these
utilities are intended to be used in constant evaluation,
implementations will be required to diagnose UB in cases like the above
(during constant evaluation). As is, examples similar to the one above
can demonstrate surprising results since, I think, there is no defined
point at which mutations are commuted. https://godbolt.org/z/MGbfWKWb7
(unfortunately, that fork of Clang is broken at the moment).

There are use cases for a facility like the one that is proposed to
enable access to an object via an underlying type relationship. For
example, to load/store an object of enumeration type via an underlying
integer type. Decoupling the cast capabilities from the UTF concerns
would enable additional use cases. Given the existence of functions that
have a wide contract with respect to well-formed UTF input, is it
desirable for the cast facility to be concerned with encoding matters at
all? Does providing two cast operations (cast_as_utf, cast_utf_to) help
to prevent programming mistakes?

Tom.

Received on 2024-05-18 17:12:48