ISOCPP sg16 List: Re: [isocpp-sg16] Agenda for the 2024-05-22 SG16 meeting

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 22 May 2024 12:54:33 -0400

Oops, I failed to send my normal reminder yesterday, so this is your
friendly reminder that this meeting is happening *TODAY*, in about 2 1/2
hours. See you soon!

Tom.

On 5/18/24 1:12 PM, Tom Honermann via SG16 wrote:
>
> SG16 will hold a meeting on Wednesday, May 22nd, at 19:30 UTC
> (timezone conversion
> <https://www.timeanddate.com/worldclock/converter.html?iso=20240522T193000&p1=1440&p2=tz_pdt&p3=tz_mdt&p4=tz_cdt&p5=tz_edt&p6=tz_cest>).
>
> The agenda follows.
>
> * Fraser to report on the May 3rd Text Terminal WG meeting.
> * Review results of the 2024 C++ Developer Survey.
> * P2626R0: charN_t incremental adoption: Casting pointers of UTF
> character types <https://wg21.link/p2626r0>.
>
> The results of the 2024 C++ Developer Survey were recently posted
> (summary results
> <https://wiki.edg.com/pub/Wg21tokyo2024/Documents/CppDevSurvey-2024-summary.pdf>,
> detailed results
> <https://wiki.edg.com/pub/Wg21tokyo2024/Documents/CppDevSurvey-2024-writeins.pdf>).
> Question 6, "Which of these do you find frustrating about C++
> development?", included a new response category this year, "Unicode,
> internationalization, and localization". Of the 17 categories, this
> one ranked 12th. The responses broke down as follows:
>
> * _Major pain point_
> 16.56%, 205 respondents.
> * _Minor pain point_
> 29.32%, 363 respondents.
> * _Not a significant issue_
> 54.12%, 670 respondents.
>
> Approximately 46% of respondents claimed this category as a pain
> point. Not that we weren't already aware, but we clearly have work to do.
>
> I audited the write in responses that mentioned SG16 related
> terminology (Unicode, character, encoding, UTF, char/N/_t, text). The
> relevant comments follow; the portions in bold are comments with
> realistical and clear actionable complaints, requests, or suggestions.
>
> * Unicode seems to be progressing nicely.
> * Unicode support. It should be standard in C++11, let alone
> <current year>.
> * Lack of Unicode - *no (clean and efficient) std functionality for
> converting from UTF-8 (char[8_t])/UTF-16 (char16_t)/UTF-32
> (char32_t) to any of the other types* *Lack of other character
> support: - like [to/from]_chars only supporting char, not char16_t
> or char32_t* even though it is based on implementations that do
> support all char types (at least the from_chars).
> * Some of the features I expected have not come out (network,
> Unicode support and so on) whereas it's already part of other
> languages standard library
> * Removed unicode support
> * Valuable committee time is wasted in discussing such facilities
> while there is STILL no reasonable Unicode support (we are talking
> about text, simple text!).
> * Basic things like networking and unicode helpers are not present
> in standard library.
> * I guess first class utf8/unicode support not improving as fast as
> I'd like it to. In 2024 it's still not very easy to write
> Unicode-aware apps which seamlessly deal with encodings,
> conversions between encodings - none of this is a first-class
> citizen of the language.
> * Internationalization and unicode support.
> * Better Unicode support in STL.
> * Full Unicode support.
> * Unicode friendly stl support for localization.
> * Proper Unicode support. In MS Windows development, virtually all
> user input is UTF-16LE in the form of wchar_t and variants. I
> convert that to UTF-8 via wrapper functions that use third-party
> Unicode libraries (uni-algo in my case) that (can) use
> std::string. *Things that should be simple but aren't in Unicode,
> like case conversion and case-insensitive comparison, should be
> provided for.* This would reduce the pain point of third-party
> libraries.
> * Unicode very important.
> * I would change the way characters and strings are represented. The
> Rust model is so much better. In practice, that means the
> character type is not integral, there are no null terminators, and
> everything is UTF-8 by default.
> * STL: missing basic components (filesystem / network / UTF-8
> encodings), not specified implementation of e.g. std::string (e.g.
> Implicit Sharing).
> * std::text_encoding.
> * utf8 support across platforms.
> * char8_t and breaking change of u8"" string literals I've been
> using a relaxed variant of the "UTF8 everywhere" manifesto in my
> Windows app with zero problems for over a decade, so std::string
> rules the roost for UTF-8 with me. C++20 char8_t and breaking u8""
> behavior gets in the way. Need to use non-portable techniques of
> naked UTF-8 string literals via MSVC /utf-8 option.
> * make utf8 the one and only type of string in the entire universe!
> * remove wide strings since they are not wide enoough on some
> platforms and just use std::strings as utf8.
> * reconsider cases where std:: can do things, but it's a horrible
> mess (like ASCII->UTF-8) to be more development and readability
> focussed. deprecate std::Xstream << operator overloading - it's
> horribly unreasonable for young devs to learn about operator
> overloading in their hello world apps, and there's 1000 more
> things wrong with those streams...
> * Utf8 std::string.
> * Ditching char8_t.
> * Whole char8_t fiasco (introduction of this type is a mistake).
>
> Most of these comments aren't particularly actionable; what exactly
> does providing "better" or "full" Unicode support entail? Others, like
> those related to UTF-8-ing all the things, aren't feasible. My
> interpretation of the above is that we can make concrete improvements
> by doing the following:
>
> 1. Add support for encoding conversions.
> 2. Add support for char/N/_t in std::from_chars() and std::to_chars().
> 3. Add support for Unicode-aware case conversions and
> case-insensitive comparisons.
>
> Much of the following is copy/paste from the agenda sent for the
> 2024-01-10 SG16 meeting
> <https://lists.isocpp.org/sg16/2024/01/4080.php> where I had planned
> for us to discuss P2626R0 but we then didn't due to time constraints.
>
> P2626R0 <https://wg21.link/p2626r0> was last discussed during the
> 2022-08-24 SG16 meeting
> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2022.md#august-24th-2022>.
> A few requests were made during that meeting and since that are yet to
> be addressed in a new revision:
>
> * Victor requested that the paper be updated to explicitly state
> early in the paper what properties of the types must match for the
> operations to be well-formed.
> * Jens asked if the paper includes examples that are reflective of
> how this facility would be used in something like real world code.
> (I'm interpreting this as a request for such examples; the
> examples in the "Tony table" section of the paper are minimal)
> * Tom requested that the paper include an example of changes that
> might be made to ICU to use the proposed facilities. E.g., how
> U_ALIASING_BARRIER
> <https://github.com/unicode-org/icu/blob/0ef4da943c1cfc694e84fcb85cee5c78bae89d71/icu4c/source/common/unicode/char16ptr.h#L30-L36>
> and its uses would be changed.
>
> See this SG16 email thread with subject "An alternative interface for
> P2626R0 ..." <https://lists.isocpp.org/sg16/2022/09/3389.php> from
> September, 2022 for some alternative considerations.
>
> There are two primary design questions that I would like to see us
> make progress on.
>
> 1. How is (or should) the duration of access by one type vs another
> be managed?
> 2. Should the ability to cast between underlying types be decoupled
> from the UTF concerns?
>
> Consider the following example that illustrates a potentially
> desirable use case; to provide a char8_t-based wrapper around an
> existing function that processes UTF-8 text in char-based storage. The
> use of cast_as_utf_unchecked(), according to the proposed wording,
> ends the lifetime of the range of code units in the text array and
> returns a pointer to a new set of objects constructed in their place
> (with object representation preserved). Following that cast, access to
> the array and its elements must be performed via the returned pointer
> and access via the original object becomes UB.
>
> void process_as_utf8(char *p, size_t N);
> inline void process_as_utf8(char8_t *p, size_t N) {
> process_as_utf8(*cast_as_utf_unchecked*(p, N));
> }
> void f() {
> char8_t text[] = u8"Zoom";
> process_as_utf8(text, sizeof(text));
> CHECK(text[0] == u8'B'); // UB.
> }
>
> The paper does not propose an explicit "undo" operation, so it is
> unclear (at least to me) how valid access through the original object
> declaration can be restored. Perhaps the intent is that programmers do
> something like the following to undo a previous cast operation?
>
> inline void process_as_utf8(const char8_t *p, size_t N) {
> const char *p_as_char = *cast_as_utf_unchecked*(p, N);
> process_as_utf8(p_as_char);
> (void)*cast_utf_to<char>*(p_as_char, N);
> }
>
> What (I think) is missing is any connection to the original
> declaration of text; I am uncertain that the transparently replaceable
> rules ([basic.life]p8 <http://eel.is/c++draft/basic.life#8>) suffice
> to cover this situation. I am concerned about how TBAA is preserved
> and at what point modifications made via one type are reflected for
> aliasing purposes by the other type; consider the case of the char
> overload of process_as_utf8() mutating the string with p[0] = 'B' as
> the CHECK() operation expects. We may have to seek guidance from CWG
> for these concerns.
>
> Use of these utilities in real world use cases will, I think, require
> that the duration of their effects be precisely specified. Since these
> utilities are intended to be used in constant evaluation,
> implementations will be required to diagnose UB in cases like the
> above (during constant evaluation). As is, examples similar to the one
> above can demonstrate surprising results since, I think, there is no
> defined point at which mutations are commuted.
> https://godbolt.org/z/MGbfWKWb7 (unfortunately, that fork of Clang is
> broken at the moment).
>
> There are use cases for a facility like the one that is proposed to
> enable access to an object via an underlying type relationship. For
> example, to load/store an object of enumeration type via an underlying
> integer type. Decoupling the cast capabilities from the UTF concerns
> would enable additional use cases. Given the existence of functions
> that have a wide contract with respect to well-formed UTF input, is it
> desirable for the cast facility to be concerned with encoding matters
> at all? Does providing two cast operations (cast_as_utf, cast_utf_to)
> help to prevent programming mistakes?
>
> Tom.
>
>

Received on 2024-05-22 16:54:37