ISOCPP std-proposals List: Re: [std-proposals] constexpr tolower, toupper, isalpha

From: JJ Marr <jjmarr_at_[hidden]>
Date: Thu, 10 Jul 2025 17:54:55 -0400

Unicode normalization stability means an already-assigned codepoint won't
change its normalized form.

If we're willing to say it's ill-formed to call Unicode functions with
unassigned Unicode codepoints, does this problem get simpler?

A valid invocation of a normalization function should never change result.

On Thu, Jul 10, 2025, 5:17 p.m. Oliver Hunt via Std-Proposals <
std-proposals_at_[hidden]> wrote:

> Hi Thiago,
>
> I'd swear you replied pointing out that I was using a colloquial (app
> level?) version of the term "normalization" rather than unicode
> normalization which is absolutely correct and my use lacked the preciseness
> that is really needed for a thread discussing standardization of unicode in
> c++ :D
>
> I was much of the opinion that constexpr <anything unicode> is not a great
> idea because it reintroduces the problem of “how does C++ reference other
> standards that release new versions on a different schedule” and of course
> “would devs expect the version of unicode being applied to be dependent on
> what version of C++ they have specified?” (You could imagine a QoI/QoL
> compiler flag to specify exact unicode versions and/or “latest”) but it
> seems like it would be a recipe for confusion.
>
> My thinking originally is that for the referenced case flattening use
> cases I really do feel like the feature people want is not “case
> insensitive” X, but rather two different things:
>
> 1. A user has provided a string, we want to do a search for that string
> using idiomatic conversions like case flattening, equivalent characters (ss
> vs ß, \ vs ¥, and similar), bidi/rtl, etc
> 2. Are these filenames the same according to the file system? Which is not
> necessarily any single concept like ascii toupper/lower, unicode case
> flattening, but rather some ossified set of rules/choices that were made
> years ago and cannot ever be changed
>
> Having bidi+rtl-aware search as well as code points and character
> iteration as part of the STL would seem like a big win even if the exact
> code points and characters could change over time.
>
> My inclination is that beyond those fundamentals any other operations you
> might perform on a unicode string should be done in terms of the basic set
> of unicode operations (in principle a more C++ friendly version of the ICU
> APIs). But I still did not think these could be reasonably constexpr due to
> the aforementioned issues.
>
> Originally I though that it was just a bit irksome that the edge cases of
> unicode version changes meant we would not permit otherwise basic
> operations in constexpr functions, but I’ve realized the original thread
> focus on capitalization blinded me to the more practical way a developer
> might want to write code.
>
> It is perfectly reasonable for a developer to write a function (the actual
> function here is just a stand in obviously :D )
>
> // Imagine we've solved object lifetimes
> std::string_view takeNCharacters(std::string_view foo, int N) {
> return foo.substr(0, N);
> }
>
> Then realize they want to use it in a consteval environment so do
>
> constexpr std::string_view takeNCharacters(std::string_view foo, int N) {
> return foo.substr(0, N);
> }
>
> Now imagine we add unicode iterators something like
>
> struct unicode_grapheme_iterator {
> ...
> };
> struct unicode_codepoint_iterator {
> ...
> };
>
> class string_view {
> ...
> // again, imagine we’ve solved lifetime so the string_view stays alive :D
> unicode_codepoint_iterator codepoints() const;
>
> // Odds are devs would prefer this to be “characters” or similar :D
> unicode_grapheme_iterator graphemes() const;
> ...
> };
>
>
> Our conscientious developer decides that they want to do the right thing
> for their use case and tries to do
>
> constexpr std::string_view takeNCharacters(std::string_view foo, int n) {
> // terrible code, forgive me
> auto iter = foo. graphemes();
> auto begin = foo.begin();
> auto ptr = begin;
> for (int i = 0; i < n && ptr != iter.end(); ++i, ++ptr) {
> // Out of curiosity is there a cleaner way to do this silliness?
> }
> // I recall substr takes indices, but humor me
> return foo.substr(begin, ptr);
> }
>
> Only now this can’t be constexpr due to the use of the unicode interface,
> even if the consteval contexts don’t actually require/involve strings that
> have any complex unicode (maybe the compile time strings are always plain
> ascii which seems plausible in many cases). Now the developer has to choose
> between having a single implementation, or having to add an `if consteval`
> block that just does ascii enumeration. Neither option is particularly
> palatable.
>
> In principle code point enumeration can be easily constexpr as it does not
> change, but in practice most cases where people care about unicode they are
> wanting to enumerate graphemes rather than code points.
>
> We could “simplify” the exciting emoji case by assuming any sequence of
> code point + ZWJ +… is a single grapheme. More annoyingly I think
> diacritics _would_ require a table/unicode reference as in principle the
> set can grow over time - but I don’t know if that is a real risk?
>
> Another option would be to make the “unicode” interfaces only be constexpr
> for (sigh) something like ascii on the vague assumption/hope that consteval
> cases are expected to contain arbitrary unicode - which still seems awful
> and very much a callback to the “everyone uses the English alphabet” core
> of more or less all string processing currently present in C and C++.
>
> —Oliver
>
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>

Received on 2025-07-10 21:55:13