On Sat, May 18, 2024 at 7:12 PM Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:
SG16 will hold a meeting on Wednesday, May 22nd, at 19:30 UTC (timezone conversion).
The agenda follows.
- Fraser to report on the May 3rd Text Terminal WG meeting.
- Review results of the 2024 C++ Developer Survey.
- P2626R0: charN_t incremental adoption: Casting pointers of UTF character types.
The results of the 2024 C++ Developer Survey were recently posted (summary results, detailed results). Question 6, "Which of these do you find frustrating about C++ development?", included a new response category this year, "Unicode, internationalization, and localization". Of the 17 categories, this one ranked 12th. The responses broke down as follows:
- Major pain point
16.56%, 205 respondents.
- Minor pain point
29.32%, 363 respondents.
- Not a significant issue
54.12%, 670 respondents.
Approximately 46% of respondents claimed this category as a pain point. Not that we weren't already aware, but we clearly have work to do.
I audited the write in responses that mentioned SG16 related terminology (Unicode, character, encoding, UTF, charN_t, text). The relevant comments follow; the portions in bold are comments with realistical and clear actionable complaints, requests, or suggestions.
- Unicode seems to be progressing nicely.
- Unicode support. It should be standard in C++11, let alone <current year>.
- Lack of Unicode - no (clean and efficient) std functionality for converting from UTF-8 (char[8_t])/UTF-16 (char16_t)/UTF-32 (char32_t) to any of the other types Lack of other character support: - like [to/from]_chars only supporting char, not char16_t or char32_t even though it is based on implementations that do support all char types (at least the from_chars).
- Some of the features I expected have not come out (network, Unicode support and so on) whereas it's already part of other languages standard library
- Removed unicode support
- Valuable committee time is wasted in discussing such facilities while there is STILL no reasonable Unicode support (we are talking about text, simple text!).
- Basic things like networking and unicode helpers are not present in standard library.
- I guess first class utf8/unicode support not improving as fast as I'd like it to. In 2024 it's still not very easy to write Unicode-aware apps which seamlessly deal with encodings, conversions between encodings - none of this is a first-class citizen of the language.
- Internationalization and unicode support.
- Better Unicode support in STL.
- Full Unicode support.
- Unicode friendly stl support for localization.
- Proper Unicode support. In MS Windows development, virtually all user input is UTF-16LE in the form of wchar_t and variants. I convert that to UTF-8 via wrapper functions that use third-party Unicode libraries (uni-algo in my case) that (can) use std::string. Things that should be simple but aren't in Unicode, like case conversion and case-insensitive comparison, should be provided for. This would reduce the pain point of third-party libraries.
- Unicode very important.
- I would change the way characters and strings are represented. The Rust model is so much better. In practice, that means the character type is not integral, there are no null terminators, and everything is UTF-8 by default.
- STL: missing basic components (filesystem / network / UTF-8 encodings), not specified implementation of e.g. std::string (e.g. Implicit Sharing).
- std::text_encoding.
- utf8 support across platforms.
- char8_t and breaking change of u8"" string literals I've been using a relaxed variant of the "UTF8 everywhere" manifesto in my Windows app with zero problems for over a decade, so std::string rules the roost for UTF-8 with me. C++20 char8_t and breaking u8"" behavior gets in the way. Need to use non-portable techniques of naked UTF-8 string literals via MSVC /utf-8 option.
- make utf8 the one and only type of string in the entire universe!
- remove wide strings since they are not wide enoough on some platforms and just use std::strings as utf8.
- reconsider cases where std:: can do things, but it's a horrible mess (like ASCII->UTF-8) to be more development and readability focussed. deprecate std::Xstream << operator overloading - it's horribly unreasonable for young devs to learn about operator overloading in their hello world apps, and there's 1000 more things wrong with those streams...
- Utf8 std::string.
- Ditching char8_t.
- Whole char8_t fiasco (introduction of this type is a mistake).
Most of these comments aren't particularly actionable; what exactly does providing "better" or "full" Unicode support entail? Others, like those related to UTF-8-ing all the things, aren't feasible. My interpretation of the above is that we can make concrete improvements by doing the following:
- Add support for encoding conversions.
- Add support for charN_t in std::from_chars() and std::to_chars().
- Add support for Unicode-aware case conversions and case-insensitive comparisons.
Much of the following is copy/paste from the agenda sent for the 2024-01-10 SG16 meeting where I had planned for us to discuss P2626R0 but we then didn't due to time constraints.
P2626R0 was last discussed during the 2022-08-24 SG16 meeting. A few requests were made during that meeting and since that are yet to be addressed in a new revision:
- Victor requested that the paper be updated to explicitly state early in the paper what properties of the types must match for the operations to be well-formed.
- Jens asked if the paper includes examples that are reflective of how this facility would be used in something like real world code.
(I'm interpreting this as a request for such examples; the examples in the "Tony table" section of the paper are minimal)- Tom requested that the paper include an example of changes that might be made to ICU to use the proposed facilities. E.g., how U_ALIASING_BARRIER and its uses would be changed.
See this SG16 email thread with subject "An alternative interface for P2626R0 ..." from September, 2022 for some alternative considerations.
There are two primary design questions that I would like to see us make progress on.
- How is (or should) the duration of access by one type vs another be managed?
- Should the ability to cast between underlying types be decoupled from the UTF concerns?
Consider the following example that illustrates a potentially desirable use case; to provide a char8_t-based wrapper around an existing function that processes UTF-8 text in char-based storage. The use of cast_as_utf_unchecked(), according to the proposed wording, ends the lifetime of the range of code units in the text array and returns a pointer to a new set of objects constructed in their place (with object representation preserved). Following that cast, access to the array and its elements must be performed via the returned pointer and access via the original object becomes UB.
void process_as_utf8(char *p, size_t N);
inline void process_as_utf8(char8_t *p, size_t N) {
process_as_utf8(cast_as_utf_unchecked(p, N));
}
void f() {
char8_t text[] = u8"Zoom";
process_as_utf8(text, sizeof(text));
CHECK(text[0] == u8'B'); // UB.
}
The paper does not propose an explicit "undo" operation, so it is unclear (at least to me) how valid access through the original object declaration can be restored. Perhaps the intent is that programmers do something like the following to undo a previous cast operation?
inline void process_as_utf8(const char8_t *p, size_t N) {
const char *p_as_char = cast_as_utf_unchecked(p, N);
process_as_utf8(p_as_char);
(void)cast_utf_to<char>(p_as_char, N);
}
What (I think) is missing is any connection to the original declaration of text; I am uncertain that the transparently replaceable rules ([basic.life]p8) suffice to cover this situation. I am concerned about how TBAA is preserved and at what point modifications made via one type are reflected for aliasing purposes by the other type; consider the case of the char overload of process_as_utf8() mutating the string with p[0] = 'B' as the CHECK() operation expects. We may have to seek guidance from CWG for these concerns.
Use of these utilities in real world use cases will, I think, require that the duration of their effects be precisely specified. Since these utilities are intended to be used in constant evaluation, implementations will be required to diagnose UB in cases like the above (during constant evaluation). As is, examples similar to the one above can demonstrate surprising results since, I think, there is no defined point at which mutations are commuted. https://godbolt.org/z/MGbfWKWb7 (unfortunately, that fork of Clang is broken at the moment).
I haven't looked at your code, but I fixed the crash. borked merge
Awesome, thank you!
You might prefer to look at this tweak of the code linked above:
https://godbolt.org/z/9Tejj9TPs.
This includes an attempt to "undo" the cast, but demonstrates that
doing so doesn't suffice to avoid the assertion failure.
There are use cases for a facility like the one that is proposed to enable access to an object via an underlying type relationship. For example, to load/store an object of enumeration type via an underlying integer type. Decoupling the cast capabilities from the UTF concerns would enable additional use cases. Given the existence of functions that have a wide contract with respect to well-formed UTF input, is it desirable for the cast facility to be concerned with encoding matters at all?
Before we both all repeat all of our arguments, can I suggest that we first figure out with core the object model semantics and the constraints that we can work in (and afaik these constraints don't allow a lot more safety than an api along the lines of what I am proposing)
Yes, absolutely.
Tom.
> Does providing two cast operations (cast_as_utf, cast_utf_to) help to prevent programming mistakes?
It is certainly the goal. We should be careful not to punch too hard through the strong types we tried to create, otherwise it's a bit self-defeating.note that I am proposing cast_as_utf_unchecked in the hope of having a cast_as_utf with preconditions later (with the benefit of a scary name).
Two assumptions I keep making are: 1/ Users are confused about encodings in c++ and by UTF-8 2/ correctness is a secondary concern to "getting things done" in a lot of situations.This explosive combination calls for some safeguards even if the safeguard is effectively a sticker on the blade of a chainsaw.--Tom.
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16