ISOCPP sg16 List: Re: [isocpp-sg16] Agenda for the 2024-05-22 SG16 meeting

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Mon, 20 May 2024 10:37:16 +0200

On Sat, May 18, 2024 at 7:12 PM Tom Honermann via SG16 <
sg16_at_[hidden]> wrote:

> SG16 will hold a meeting on Wednesday, May 22nd, at 19:30 UTC (timezone
> conversion
> <https://www.timeanddate.com/worldclock/converter.html?iso=20240522T193000&p1=1440&p2=tz_pdt&p3=tz_mdt&p4=tz_cdt&p5=tz_edt&p6=tz_cest>
> ).
>
> The agenda follows.
>
> - Fraser to report on the May 3rd Text Terminal WG meeting.
> - Review results of the 2024 C++ Developer Survey.
> - P2626R0: charN_t incremental adoption: Casting pointers of UTF
> character types <https://wg21.link/p2626r0>.
>
> The results of the 2024 C++ Developer Survey were recently posted (summary
> results
> <https://wiki.edg.com/pub/Wg21tokyo2024/Documents/CppDevSurvey-2024-summary.pdf>,
> detailed results
> <https://wiki.edg.com/pub/Wg21tokyo2024/Documents/CppDevSurvey-2024-writeins.pdf>).
> Question 6, "Which of these do you find frustrating about C++
> development?", included a new response category this year, "Unicode,
> internationalization, and localization". Of the 17 categories, this one
> ranked 12th. The responses broke down as follows:
>
> - *Major pain point*
> 16.56%, 205 respondents.
> - *Minor pain point*
> 29.32%, 363 respondents.
> - *Not a significant issue*
> 54.12%, 670 respondents.
>
> Approximately 46% of respondents claimed this category as a pain point.
> Not that we weren't already aware, but we clearly have work to do.
>
> I audited the write in responses that mentioned SG16 related terminology
> (Unicode, character, encoding, UTF, char*N*_t, text). The relevant
> comments follow; the portions in bold are comments with realistical and
> clear actionable complaints, requests, or suggestions.
>
> - Unicode seems to be progressing nicely.
> - Unicode support. It should be standard in C++11, let alone <current
> year>.
> - Lack of Unicode - *no (clean and efficient) std functionality for
> converting from UTF-8 (char[8_t])/UTF-16 (char16_t)/UTF-32 (char32_t) to
> any of the other types* *Lack of other character support: - like
> [to/from]_chars only supporting char, not char16_t or char32_t* even
> though it is based on implementations that do support all char types (at
> least the from_chars).
> - Some of the features I expected have not come out (network, Unicode
> support and so on) whereas it's already part of other languages standard
> library
> - Removed unicode support
> - Valuable committee time is wasted in discussing such facilities
> while there is STILL no reasonable Unicode support (we are talking about
> text, simple text!).
> - Basic things like networking and unicode helpers are not present in
> standard library.
> - I guess first class utf8/unicode support not improving as fast as
> I'd like it to. In 2024 it's still not very easy to write Unicode-aware
> apps which seamlessly deal with encodings, conversions between encodings -
> none of this is a first-class citizen of the language.
> - Internationalization and unicode support.
> - Better Unicode support in STL.
> - Full Unicode support.
> - Unicode friendly stl support for localization.
> - Proper Unicode support. In MS Windows development, virtually all
> user input is UTF-16LE in the form of wchar_t and variants. I convert that
> to UTF-8 via wrapper functions that use third-party Unicode libraries
> (uni-algo in my case) that (can) use std::string. *Things that should
> be simple but aren't in Unicode, like case conversion and case-insensitive
> comparison, should be provided for.* This would reduce the pain point
> of third-party libraries.
> - Unicode very important.
> - I would change the way characters and strings are represented. The
> Rust model is so much better. In practice, that means the character type is
> not integral, there are no null terminators, and everything is UTF-8 by
> default.
> - STL: missing basic components (filesystem / network / UTF-8
> encodings), not specified implementation of e.g. std::string (e.g. Implicit
> Sharing).
> - std::text_encoding.
> - utf8 support across platforms.
> - char8_t and breaking change of u8"" string literals I've been using
> a relaxed variant of the "UTF8 everywhere" manifesto in my Windows app with
> zero problems for over a decade, so std::string rules the roost for UTF-8
> with me. C++20 char8_t and breaking u8"" behavior gets in the way. Need to
> use non-portable techniques of naked UTF-8 string literals via MSVC /utf-8
> option.
> - make utf8 the one and only type of string in the entire universe!
> - remove wide strings since they are not wide enoough on some
> platforms and just use std::strings as utf8.
> - reconsider cases where std:: can do things, but it's a horrible mess
> (like ASCII->UTF-8) to be more development and readability focussed.
> deprecate std::Xstream << operator overloading - it's horribly unreasonable
> for young devs to learn about operator overloading in their hello world
> apps, and there's 1000 more things wrong with those streams...
> - Utf8 std::string.
> - Ditching char8_t.
> - Whole char8_t fiasco (introduction of this type is a mistake).
>
> Most of these comments aren't particularly actionable; what exactly does
> providing "better" or "full" Unicode support entail? Others, like those
> related to UTF-8-ing all the things, aren't feasible. My interpretation of
> the above is that we can make concrete improvements by doing the following:
>
> 1. Add support for encoding conversions.
> 2. Add support for char*N*_t in std::from_chars() and std::to_chars().
> 3. Add support for Unicode-aware case conversions and case-insensitive
> comparisons.
>
> Much of the following is copy/paste from the agenda sent for the
> 2024-01-10 SG16 meeting <https://lists.isocpp.org/sg16/2024/01/4080.php>
> where I had planned for us to discuss P2626R0 but we then didn't due to
> time constraints.
>
> P2626R0 <https://wg21.link/p2626r0> was last discussed during the 2022-08-24
> SG16 meeting
> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2022.md#august-24th-2022>.
> A few requests were made during that meeting and since that are yet to be
> addressed in a new revision:
>
> - Victor requested that the paper be updated to explicitly state early
> in the paper what properties of the types must match for the operations to
> be well-formed.
> - Jens asked if the paper includes examples that are reflective of how
> this facility would be used in something like real world code.
> (I'm interpreting this as a request for such examples; the examples in
> the "Tony table" section of the paper are minimal)
> - Tom requested that the paper include an example of changes that
> might be made to ICU to use the proposed facilities. E.g., how
> U_ALIASING_BARRIER
> <https://github.com/unicode-org/icu/blob/0ef4da943c1cfc694e84fcb85cee5c78bae89d71/icu4c/source/common/unicode/char16ptr.h#L30-L36>
> and its uses would be changed.
>
> See this SG16 email thread with subject "An alternative interface for
> P2626R0 ..." <https://lists.isocpp.org/sg16/2022/09/3389.php> from
> September, 2022 for some alternative considerations.
>
> There are two primary design questions that I would like to see us make
> progress on.
>
> 1. How is (or should) the duration of access by one type vs another be
> managed?
> 2. Should the ability to cast between underlying types be decoupled
> from the UTF concerns?
>
> Consider the following example that illustrates a potentially desirable
> use case; to provide a char8_t-based wrapper around an existing function
> that processes UTF-8 text in char-based storage. The use of
> cast_as_utf_unchecked(), according to the proposed wording, ends the
> lifetime of the range of code units in the text array and returns a
> pointer to a new set of objects constructed in their place (with object
> representation preserved). Following that cast, access to the array and its
> elements must be performed via the returned pointer and access via the
> original object becomes UB.
>
> void process_as_utf8(char *p, size_t N);
> inline void process_as_utf8(char8_t *p, size_t N) {
> process_as_utf8(*cast_as_utf_unchecked*(p, N));
> }
> void f() {
> char8_t text[] = u8"Zoom";
> process_as_utf8(text, sizeof(text));
> CHECK(text[0] == u8'B'); // UB.
> }
>
> The paper does not propose an explicit "undo" operation, so it is unclear
> (at least to me) how valid access through the original object declaration
> can be restored. Perhaps the intent is that programmers do something like
> the following to undo a previous cast operation?
>
> inline void process_as_utf8(const char8_t *p, size_t N) {
> const char *p_as_char = *cast_as_utf_unchecked*(p, N);
> process_as_utf8(p_as_char);
> (void)*cast_utf_to<char>*(p_as_char, N);
> }
>
> What (I think) is missing is any connection to the original declaration of
> text; I am uncertain that the transparently replaceable rules (
> [basic.life]p8 <http://eel.is/c++draft/basic.life#8>) suffice to cover
> this situation. I am concerned about how TBAA is preserved and at what
> point modifications made via one type are reflected for aliasing purposes
> by the other type; consider the case of the char overload of
> process_as_utf8() mutating the string with p[0] = 'B' as the CHECK()
> operation expects. We may have to seek guidance from CWG for these concerns.
>
> Use of these utilities in real world use cases will, I think, require that
> the duration of their effects be precisely specified. Since these utilities
> are intended to be used in constant evaluation, implementations will be
> required to diagnose UB in cases like the above (during constant
> evaluation). As is, examples similar to the one above can demonstrate
> surprising results since, I think, there is no defined point at which
> mutations are commuted. https://godbolt.org/z/MGbfWKWb7 (unfortunately,
> that fork of Clang is broken at the moment).
>

I haven't looked at your code, but I fixed the crash. borked merge

> There are use cases for a facility like the one that is proposed to enable
> access to an object via an underlying type relationship. For example, to
> load/store an object of enumeration type via an underlying integer type.
> Decoupling the cast capabilities from the UTF concerns would enable
> additional use cases. Given the existence of functions that have a wide
> contract with respect to well-formed UTF input, is it desirable for the
> cast facility to be concerned with encoding matters at all?
>

Before we both all repeat all of our arguments, can I suggest that we first
figure out with core the object model semantics and the constraints that we
can work in (and afaik these constraints don't allow a lot more safety than
an api along the lines of what I am proposing)

> Does providing two cast operations (cast_as_utf, cast_utf_to) help to
prevent programming mistakes?

It is certainly the goal. We should be careful not to punch too hard
through the strong types we tried to create, otherwise it's a bit
self-defeating.
note that I am proposing cast_as_utf_unchecked in the hope of having a
cast_as_utf with preconditions later (with the benefit of a scary name).

Two assumptions I keep making are: 1/ Users are confused about encodings in
c++ and by UTF-8 2/ correctness is a secondary concern to "getting things
done" in a lot of situations.
This explosive combination calls for some safeguards even if the safeguard
is effectively a sticker on the blade of a chainsaw.

> Tom.
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2024-05-20 08:37:37