SG16 will hold a meeting on Wednesday, May 22nd, at 19:30 UTC (timezone conversion).
The agenda follows.
The results of the 2024 C++ Developer Survey were recently posted
(summary
results, detailed
results). Question 6, "Which of these do you find
frustrating about C++ development?", included a new response
category this year, "Unicode, internationalization, and
localization". Of the 17 categories, this one ranked 12th. The
responses broke down as follows:
- Major pain point
16.56%, 205 respondents.
- Minor pain point
29.32%, 363 respondents.
- Not a significant issue
54.12%, 670 respondents.
Approximately 46% of respondents claimed this category as a pain
point. Not that we weren't already aware, but we clearly have work
to do.
I audited the write in responses that mentioned SG16 related
terminology (Unicode, character, encoding, UTF, charN_t,
text). The relevant comments follow; the portions in bold are
comments with realistical and clear actionable complaints,
requests, or suggestions.
- Unicode seems to be progressing nicely.
- Unicode support. It should be standard in C++11, let alone
<current year>.
- Lack of Unicode - no (clean and efficient) std
functionality for converting from UTF-8 (char[8_t])/UTF-16
(char16_t)/UTF-32 (char32_t) to any of the other types Lack
of other character support: - like [to/from]_chars only
supporting char, not char16_t or char32_t even though it
is based on implementations that do support all char types (at
least the from_chars).
- Some of the features I expected have not come out (network,
Unicode support and so on) whereas it's already part of other
languages standard library
- Removed unicode support
- Valuable committee time is wasted in discussing such
facilities while there is STILL no reasonable Unicode support
(we are talking about text, simple text!).
- Basic things like networking and unicode helpers are not
present in standard library.
- I guess first class utf8/unicode support not improving as fast
as I'd like it to. In 2024 it's still not very easy to write
Unicode-aware apps which seamlessly deal with encodings,
conversions between encodings - none of this is a first-class
citizen of the language.
- Internationalization and unicode support.
- Better Unicode support in STL.
- Full Unicode support.
- Unicode friendly stl support for localization.
- Proper Unicode support. In MS Windows development, virtually
all user input is UTF-16LE in the form of wchar_t and variants.
I convert that to UTF-8 via wrapper functions that use
third-party Unicode libraries (uni-algo in my case) that (can)
use std::string. Things that should be simple but aren't in
Unicode, like case conversion and case-insensitive comparison,
should be provided for. This would reduce the pain point
of third-party libraries.
- Unicode very important.
- I would change the way characters and strings are represented.
The Rust model is so much better. In practice, that means the
character type is not integral, there are no null terminators,
and everything is UTF-8 by default.
- STL: missing basic components (filesystem / network / UTF-8
encodings), not specified implementation of e.g. std::string
(e.g. Implicit Sharing).
- std::text_encoding.
- utf8 support across platforms.
- char8_t and breaking change of u8"" string literals I've been
using a relaxed variant of the "UTF8 everywhere" manifesto in my
Windows app with zero problems for over a decade, so std::string
rules the roost for UTF-8 with me. C++20 char8_t and breaking
u8"" behavior gets in the way. Need to use non-portable
techniques of naked UTF-8 string literals via MSVC /utf-8
option.
- make utf8 the one and only type of string in the entire
universe!
- remove wide strings since they are not wide enoough on some
platforms and just use std::strings as utf8.
- reconsider cases where std:: can do things, but it's a
horrible mess (like ASCII->UTF-8) to be more development and
readability focussed. deprecate std::Xstream << operator
overloading - it's horribly unreasonable for young devs to learn
about operator overloading in their hello world apps, and
there's 1000 more things wrong with those streams...
- Utf8 std::string.
- Ditching char8_t.
- Whole char8_t fiasco (introduction of this type is a mistake).
Most of these comments aren't particularly actionable; what
exactly does providing "better" or "full" Unicode support entail?
Others, like those related to UTF-8-ing all the things, aren't
feasible. My interpretation of the above is that we can make
concrete improvements by doing the following:
- Add support for encoding conversions.
- Add support for charN_t
in std::from_chars() and std::to_chars().
- Add support for Unicode-aware case conversions and
case-insensitive comparisons.
Much of the following is copy/paste from the agenda
sent for the 2024-01-10 SG16 meeting where I had planned for
us to discuss P2626R0 but we then didn't due to time constraints.
P2626R0 was last
discussed during the 2022-08-24
SG16 meeting. A few requests were made during that meeting
and since that are yet to be addressed in a new revision:
- Victor requested that the paper be updated to explicitly state
early in the paper what properties of the types must match for
the operations to be well-formed.
- Jens asked if the paper includes examples that are reflective
of how this facility would be used in something like real world
code.
(I'm interpreting this as a request for such examples; the
examples in the "Tony table" section of the paper are minimal)
- Tom requested that the paper include an example of changes
that might be made to ICU to use the proposed facilities. E.g.,
how U_ALIASING_BARRIER
and its uses would be changed.
See this
SG16 email thread with subject "An alternative interface for
P2626R0 ..." from September, 2022 for some alternative
considerations.
There are two primary design questions that I would like to see
us make progress on.
- How is (or should) the duration of access by one type vs
another be managed?
- Should the ability to cast between underlying types be
decoupled from the UTF concerns?
Consider the following example that illustrates a potentially
desirable use case; to provide a char8_t-based
wrapper around an existing function that processes UTF-8 text in char-based storage. The use of cast_as_utf_unchecked(), according to
the proposed wording, ends the lifetime of the range of code units
in the text array and returns a
pointer to a new set of objects constructed in their place (with
object representation preserved). Following that cast, access to
the array and its elements must be performed via the returned
pointer and access via the original object becomes UB.
void process_as_utf8(char *p, size_t N);
inline void process_as_utf8(char8_t *p, size_t N) {
process_as_utf8(cast_as_utf_unchecked(p, N));
}
void f() {
char8_t text[] = u8"Zoom";
process_as_utf8(text, sizeof(text));
CHECK(text[0] == u8'B'); // UB.
}
The paper does not propose an explicit "undo" operation, so it is
unclear (at least to me) how valid access through the original
object declaration can be restored. Perhaps the intent is that
programmers do something like the following to undo a previous
cast operation?
inline void process_as_utf8(const
char8_t *p, size_t N) {
const char *p_as_char = cast_as_utf_unchecked(p,
N);
process_as_utf8(p_as_char);
(void)cast_utf_to<char>(p_as_char,
N);
}
What (I think) is missing is any connection to the original
declaration of text; I am uncertain
that the transparently replaceable rules ([basic.life]p8)
suffice to cover this situation. I am concerned about how TBAA is
preserved and at what point modifications made via one type are
reflected for aliasing purposes by the other type; consider the
case of the char overload of process_as_utf8() mutating the string
with p[0] = 'B' as the CHECK() operation expects. We may have
to seek guidance from CWG for these concerns.
Use of these utilities in real world use cases will, I think,
require that the duration of their effects be precisely specified.
Since these utilities are intended to be used in constant
evaluation, implementations will be required to diagnose UB in
cases like the above (during constant evaluation). As is, examples
similar to the one above can demonstrate surprising results since,
I think, there is no defined point at which mutations are
commuted. https://godbolt.org/z/MGbfWKWb7
(unfortunately, that fork of Clang is broken at the moment).
There are use cases for a facility like the one that is proposed to
enable access to an object via an underlying type relationship. For
example, to load/store an object of enumeration type via an
underlying integer type. Decoupling the cast capabilities from the
UTF concerns would enable additional use cases. Given the existence
of functions that have a wide contract with respect to well-formed
UTF input, is it desirable for the cast facility to be concerned
with encoding matters at all? Does providing two cast operations (cast_as_utf, cast_utf_to)
help to prevent programming mistakes?
Tom.