Date: Thu, 10 Jul 2025 14:17:02 -0700
Hi Thiago,
I'd swear you replied pointing out that I was using a colloquial (app level?) version of the term "normalization" rather than unicode normalization which is absolutely correct and my use lacked the preciseness that is really needed for a thread discussing standardization of unicode in c++ :D
I was much of the opinion that constexpr <anything unicode> is not a great idea because it reintroduces the problem of “how does C++ reference other standards that release new versions on a different schedule” and of course “would devs expect the version of unicode being applied to be dependent on what version of C++ they have specified?” (You could imagine a QoI/QoL compiler flag to specify exact unicode versions and/or “latest”) but it seems like it would be a recipe for confusion.
My thinking originally is that for the referenced case flattening use cases I really do feel like the feature people want is not “case insensitive” X, but rather two different things:
1. A user has provided a string, we want to do a search for that string using idiomatic conversions like case flattening, equivalent characters (ss vs ß, \ vs ¥, and similar), bidi/rtl, etc
2. Are these filenames the same according to the file system? Which is not necessarily any single concept like ascii toupper/lower, unicode case flattening, but rather some ossified set of rules/choices that were made years ago and cannot ever be changed
Having bidi+rtl-aware search as well as code points and character iteration as part of the STL would seem like a big win even if the exact code points and characters could change over time.
My inclination is that beyond those fundamentals any other operations you might perform on a unicode string should be done in terms of the basic set of unicode operations (in principle a more C++ friendly version of the ICU APIs). But I still did not think these could be reasonably constexpr due to the aforementioned issues.
Originally I though that it was just a bit irksome that the edge cases of unicode version changes meant we would not permit otherwise basic operations in constexpr functions, but I’ve realized the original thread focus on capitalization blinded me to the more practical way a developer might want to write code.
It is perfectly reasonable for a developer to write a function (the actual function here is just a stand in obviously :D )
// Imagine we've solved object lifetimes
std::string_view takeNCharacters(std::string_view foo, int N) {
return foo.substr(0, N);
}
Then realize they want to use it in a consteval environment so do
constexpr std::string_view takeNCharacters(std::string_view foo, int N) {
return foo.substr(0, N);
}
Now imagine we add unicode iterators something like
struct unicode_grapheme_iterator {
...
};
struct unicode_codepoint_iterator {
...
};
class string_view {
...
// again, imagine we’ve solved lifetime so the string_view stays alive :D
unicode_codepoint_iterator codepoints() const;
// Odds are devs would prefer this to be “characters” or similar :D
unicode_grapheme_iterator graphemes() const;
...
};
Our conscientious developer decides that they want to do the right thing for their use case and tries to do
constexpr std::string_view takeNCharacters(std::string_view foo, int n) {
// terrible code, forgive me
auto iter = foo. graphemes();
auto begin = foo.begin();
auto ptr = begin;
for (int i = 0; i < n && ptr != iter.end(); ++i, ++ptr) {
// Out of curiosity is there a cleaner way to do this silliness?
}
// I recall substr takes indices, but humor me
return foo.substr(begin, ptr);
}
Only now this can’t be constexpr due to the use of the unicode interface, even if the consteval contexts don’t actually require/involve strings that have any complex unicode (maybe the compile time strings are always plain ascii which seems plausible in many cases). Now the developer has to choose between having a single implementation, or having to add an `if consteval` block that just does ascii enumeration. Neither option is particularly palatable.
In principle code point enumeration can be easily constexpr as it does not change, but in practice most cases where people care about unicode they are wanting to enumerate graphemes rather than code points.
We could “simplify” the exciting emoji case by assuming any sequence of code point + ZWJ +… is a single grapheme. More annoyingly I think diacritics _would_ require a table/unicode reference as in principle the set can grow over time - but I don’t know if that is a real risk?
Another option would be to make the “unicode” interfaces only be constexpr for (sigh) something like ascii on the vague assumption/hope that consteval cases are expected to contain arbitrary unicode - which still seems awful and very much a callback to the “everyone uses the English alphabet” core of more or less all string processing currently present in C and C++.
—Oliver
I'd swear you replied pointing out that I was using a colloquial (app level?) version of the term "normalization" rather than unicode normalization which is absolutely correct and my use lacked the preciseness that is really needed for a thread discussing standardization of unicode in c++ :D
I was much of the opinion that constexpr <anything unicode> is not a great idea because it reintroduces the problem of “how does C++ reference other standards that release new versions on a different schedule” and of course “would devs expect the version of unicode being applied to be dependent on what version of C++ they have specified?” (You could imagine a QoI/QoL compiler flag to specify exact unicode versions and/or “latest”) but it seems like it would be a recipe for confusion.
My thinking originally is that for the referenced case flattening use cases I really do feel like the feature people want is not “case insensitive” X, but rather two different things:
1. A user has provided a string, we want to do a search for that string using idiomatic conversions like case flattening, equivalent characters (ss vs ß, \ vs ¥, and similar), bidi/rtl, etc
2. Are these filenames the same according to the file system? Which is not necessarily any single concept like ascii toupper/lower, unicode case flattening, but rather some ossified set of rules/choices that were made years ago and cannot ever be changed
Having bidi+rtl-aware search as well as code points and character iteration as part of the STL would seem like a big win even if the exact code points and characters could change over time.
My inclination is that beyond those fundamentals any other operations you might perform on a unicode string should be done in terms of the basic set of unicode operations (in principle a more C++ friendly version of the ICU APIs). But I still did not think these could be reasonably constexpr due to the aforementioned issues.
Originally I though that it was just a bit irksome that the edge cases of unicode version changes meant we would not permit otherwise basic operations in constexpr functions, but I’ve realized the original thread focus on capitalization blinded me to the more practical way a developer might want to write code.
It is perfectly reasonable for a developer to write a function (the actual function here is just a stand in obviously :D )
// Imagine we've solved object lifetimes
std::string_view takeNCharacters(std::string_view foo, int N) {
return foo.substr(0, N);
}
Then realize they want to use it in a consteval environment so do
constexpr std::string_view takeNCharacters(std::string_view foo, int N) {
return foo.substr(0, N);
}
Now imagine we add unicode iterators something like
struct unicode_grapheme_iterator {
...
};
struct unicode_codepoint_iterator {
...
};
class string_view {
...
// again, imagine we’ve solved lifetime so the string_view stays alive :D
unicode_codepoint_iterator codepoints() const;
// Odds are devs would prefer this to be “characters” or similar :D
unicode_grapheme_iterator graphemes() const;
...
};
Our conscientious developer decides that they want to do the right thing for their use case and tries to do
constexpr std::string_view takeNCharacters(std::string_view foo, int n) {
// terrible code, forgive me
auto iter = foo. graphemes();
auto begin = foo.begin();
auto ptr = begin;
for (int i = 0; i < n && ptr != iter.end(); ++i, ++ptr) {
// Out of curiosity is there a cleaner way to do this silliness?
}
// I recall substr takes indices, but humor me
return foo.substr(begin, ptr);
}
Only now this can’t be constexpr due to the use of the unicode interface, even if the consteval contexts don’t actually require/involve strings that have any complex unicode (maybe the compile time strings are always plain ascii which seems plausible in many cases). Now the developer has to choose between having a single implementation, or having to add an `if consteval` block that just does ascii enumeration. Neither option is particularly palatable.
In principle code point enumeration can be easily constexpr as it does not change, but in practice most cases where people care about unicode they are wanting to enumerate graphemes rather than code points.
We could “simplify” the exciting emoji case by assuming any sequence of code point + ZWJ +… is a single grapheme. More annoyingly I think diacritics _would_ require a table/unicode reference as in principle the set can grow over time - but I don’t know if that is a real risk?
Another option would be to make the “unicode” interfaces only be constexpr for (sigh) something like ascii on the vague assumption/hope that consteval cases are expected to contain arbitrary unicode - which still seems awful and very much a callback to the “everyone uses the English alphabet” core of more or less all string processing currently present in C and C++.
—Oliver
Received on 2025-07-10 21:17:15