ISOCPP sg16 List: Re: [isocpp-sg16] Agenda for the 2024-05-22 SG16 meeting

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 21 May 2024 13:36:38 -0400

On 5/20/24 4:37 AM, Corentin Jabot wrote:
>
>
>
> On Sat, May 18, 2024 at 7:12 PM Tom Honermann via SG16
> <sg16_at_[hidden]> wrote:
>
> SG16 will hold a meeting on Wednesday, May 22nd, at 19:30 UTC
> (timezone conversion
> <https://www.timeanddate.com/worldclock/converter.html?iso=20240522T193000&p1=1440&p2=tz_pdt&p3=tz_mdt&p4=tz_cdt&p5=tz_edt&p6=tz_cest>).
>
> The agenda follows.
>
> * Fraser to report on the May 3rd Text Terminal WG meeting.
> * Review results of the 2024 C++ Developer Survey.
> * P2626R0: charN_t incremental adoption: Casting pointers of UTF
> character types <https://wg21.link/p2626r0>.
>
> The results of the 2024 C++ Developer Survey were recently posted
> (summary results
> <https://wiki.edg.com/pub/Wg21tokyo2024/Documents/CppDevSurvey-2024-summary.pdf>,
> detailed results
> <https://wiki.edg.com/pub/Wg21tokyo2024/Documents/CppDevSurvey-2024-writeins.pdf>).
> Question 6, "Which of these do you find frustrating about C++
> development?", included a new response category this year,
> "Unicode, internationalization, and localization". Of the 17
> categories, this one ranked 12th. The responses broke down as follows:
>
> * _Major pain point_
> 16.56%, 205 respondents.
> * _Minor pain point_
> 29.32%, 363 respondents.
> * _Not a significant issue_
> 54.12%, 670 respondents.
>
> Approximately 46% of respondents claimed this category as a pain
> point. Not that we weren't already aware, but we clearly have work
> to do.
>
> I audited the write in responses that mentioned SG16 related
> terminology (Unicode, character, encoding, UTF, char/N/_t, text).
> The relevant comments follow; the portions in bold are comments
> with realistical and clear actionable complaints, requests, or
> suggestions.
>
> * Unicode seems to be progressing nicely.
> * Unicode support. It should be standard in C++11, let alone
> <current year>.
> * Lack of Unicode - *no (clean and efficient) std functionality
> for converting from UTF-8 (char[8_t])/UTF-16 (char16_t)/UTF-32
> (char32_t) to any of the other types* *Lack of other character
> support: - like [to/from]_chars only supporting char, not
> char16_t or char32_t* even though it is based on
> implementations that do support all char types (at least the
> from_chars).
> * Some of the features I expected have not come out (network,
> Unicode support and so on) whereas it's already part of other
> languages standard library
> * Removed unicode support
> * Valuable committee time is wasted in discussing such
> facilities while there is STILL no reasonable Unicode support
> (we are talking about text, simple text!).
> * Basic things like networking and unicode helpers are not
> present in standard library.
> * I guess first class utf8/unicode support not improving as fast
> as I'd like it to. In 2024 it's still not very easy to write
> Unicode-aware apps which seamlessly deal with encodings,
> conversions between encodings - none of this is a first-class
> citizen of the language.
> * Internationalization and unicode support.
> * Better Unicode support in STL.
> * Full Unicode support.
> * Unicode friendly stl support for localization.
> * Proper Unicode support. In MS Windows development, virtually
> all user input is UTF-16LE in the form of wchar_t and
> variants. I convert that to UTF-8 via wrapper functions that
> use third-party Unicode libraries (uni-algo in my case) that
> (can) use std::string. *Things that should be simple but
> aren't in Unicode, like case conversion and case-insensitive
> comparison, should be provided for.* This would reduce the
> pain point of third-party libraries.
> * Unicode very important.
> * I would change the way characters and strings are represented.
> The Rust model is so much better. In practice, that means the
> character type is not integral, there are no null terminators,
> and everything is UTF-8 by default.
> * STL: missing basic components (filesystem / network / UTF-8
> encodings), not specified implementation of e.g. std::string
> (e.g. Implicit Sharing).
> * std::text_encoding.
> * utf8 support across platforms.
> * char8_t and breaking change of u8"" string literals I've been
> using a relaxed variant of the "UTF8 everywhere" manifesto in
> my Windows app with zero problems for over a decade, so
> std::string rules the roost for UTF-8 with me. C++20 char8_t
> and breaking u8"" behavior gets in the way. Need to use
> non-portable techniques of naked UTF-8 string literals via
> MSVC /utf-8 option.
> * make utf8 the one and only type of string in the entire universe!
> * remove wide strings since they are not wide enoough on some
> platforms and just use std::strings as utf8.
> * reconsider cases where std:: can do things, but it's a
> horrible mess (like ASCII->UTF-8) to be more development and
> readability focussed. deprecate std::Xstream << operator
> overloading - it's horribly unreasonable for young devs to
> learn about operator overloading in their hello world apps,
> and there's 1000 more things wrong with those streams...
> * Utf8 std::string.
> * Ditching char8_t.
> * Whole char8_t fiasco (introduction of this type is a mistake).
>
> Most of these comments aren't particularly actionable; what
> exactly does providing "better" or "full" Unicode support entail?
> Others, like those related to UTF-8-ing all the things, aren't
> feasible. My interpretation of the above is that we can make
> concrete improvements by doing the following:
>
> 1. Add support for encoding conversions.
> 2. Add support for char/N/_t in std::from_chars() and
> std::to_chars().
> 3. Add support for Unicode-aware case conversions and
> case-insensitive comparisons.
>
> Much of the following is copy/paste from the agenda sent for the
> 2024-01-10 SG16 meeting
> <https://lists.isocpp.org/sg16/2024/01/4080.php> where I had
> planned for us to discuss P2626R0 but we then didn't due to time
> constraints.
>
> P2626R0 <https://wg21.link/p2626r0> was last discussed during the
> 2022-08-24 SG16 meeting
> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2022.md#august-24th-2022>.
> A few requests were made during that meeting and since that are
> yet to be addressed in a new revision:
>
> * Victor requested that the paper be updated to explicitly state
> early in the paper what properties of the types must match for
> the operations to be well-formed.
> * Jens asked if the paper includes examples that are reflective
> of how this facility would be used in something like real
> world code.
> (I'm interpreting this as a request for such examples; the
> examples in the "Tony table" section of the paper are minimal)
> * Tom requested that the paper include an example of changes
> that might be made to ICU to use the proposed facilities.
> E.g., how U_ALIASING_BARRIER
> <https://github.com/unicode-org/icu/blob/0ef4da943c1cfc694e84fcb85cee5c78bae89d71/icu4c/source/common/unicode/char16ptr.h#L30-L36>
> and its uses would be changed.
>
> See this SG16 email thread with subject "An alternative interface
> for P2626R0 ..." <https://lists.isocpp.org/sg16/2022/09/3389.php>
> from September, 2022 for some alternative considerations.
>
> There are two primary design questions that I would like to see us
> make progress on.
>
> 1. How is (or should) the duration of access by one type vs
> another be managed?
> 2. Should the ability to cast between underlying types be
> decoupled from the UTF concerns?
>
> Consider the following example that illustrates a potentially
> desirable use case; to provide a char8_t-based wrapper around an
> existing function that processes UTF-8 text in char-based storage.
> The use of cast_as_utf_unchecked(), according to the proposed
> wording, ends the lifetime of the range of code units in the text
> array and returns a pointer to a new set of objects constructed in
> their place (with object representation preserved). Following that
> cast, access to the array and its elements must be performed via
> the returned pointer and access via the original object becomes UB.
>
> void process_as_utf8(char *p, size_t N);
> inline void process_as_utf8(char8_t *p, size_t N) {
> process_as_utf8(*cast_as_utf_unchecked*(p, N));
> }
> void f() {
> char8_t text[] = u8"Zoom";
> process_as_utf8(text, sizeof(text));
> CHECK(text[0] == u8'B'); // UB.
> }
>
> The paper does not propose an explicit "undo" operation, so it is
> unclear (at least to me) how valid access through the original
> object declaration can be restored. Perhaps the intent is that
> programmers do something like the following to undo a previous
> cast operation?
>
> inline void process_as_utf8(const char8_t *p, size_t N) {
> const char *p_as_char = *cast_as_utf_unchecked*(p, N);
> process_as_utf8(p_as_char);
> (void)*cast_utf_to<char>*(p_as_char, N);
> }
>
> What (I think) is missing is any connection to the original
> declaration of text; I am uncertain that the transparently
> replaceable rules ([basic.life]p8
> <http://eel.is/c++draft/basic.life#8>) suffice to cover this
> situation. I am concerned about how TBAA is preserved and at what
> point modifications made via one type are reflected for aliasing
> purposes by the other type; consider the case of the char overload
> of process_as_utf8() mutating the string with p[0] = 'B' as the
> CHECK() operation expects. We may have to seek guidance from CWG
> for these concerns.
>
> Use of these utilities in real world use cases will, I think,
> require that the duration of their effects be precisely specified.
> Since these utilities are intended to be used in constant
> evaluation, implementations will be required to diagnose UB in
> cases like the above (during constant evaluation). As is, examples
> similar to the one above can demonstrate surprising results since,
> I think, there is no defined point at which mutations are
> commuted. https://godbolt.org/z/MGbfWKWb7 (unfortunately, that
> fork of Clang is broken at the moment).
>
>
> I haven't looked at your code, but I fixed the crash. borked merge

Awesome, thank you!

You might prefer to look at this tweak of the code linked above:
https://godbolt.org/z/9Tejj9TPs. This includes an attempt to "undo" the
cast, but demonstrates that doing so doesn't suffice to avoid the
assertion failure.

> There are use cases for a facility like the one that is proposed
> to enable access to an object via an underlying type relationship.
> For example, to load/store an object of enumeration type via an
> underlying integer type. Decoupling the cast capabilities from the
> UTF concerns would enable additional use cases. Given the
> existence of functions that have a wide contract with respect to
> well-formed UTF input, is it desirable for the cast facility to be
> concerned with encoding matters at all?
>
>
> Before we both all repeat all of our arguments, can I suggest that we
> first figure out with core the object model semantics and the
> constraints that we can work in (and afaik these constraints don't
> allow a lot more safety than an api along the lines of what I am
> proposing)

Yes, absolutely.

Tom.

>
> > Does providing two cast operations (cast_as_utf, cast_utf_to) help
> to prevent programming mistakes?
>
> It is certainly the goal. We should be careful not to punch too hard
> through the strong types we tried to create, otherwise it's a bit
> self-defeating.
> note that I am proposing cast_as_utf_unchecked in the hope of having a
> cast_as_utf with preconditions later (with the benefit of a scary name).
>
> Two assumptions I keep making are: 1/ Users are confused about
> encodings in c++ and by UTF-8 2/ correctness is a secondary concern
> to "getting things done" in a lot of situations.
> This explosive combination calls for some safeguards even if the
> safeguard is effectively a sticker on the blade of a chainsaw.
>
> Tom.
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2024-05-21 17:36:44