C++ Logo

std-proposals

Advanced search

Re: [std-proposals] constexpr tolower, toupper, isalpha

From: Oliver Hunt <oliver_at_[hidden]>
Date: Wed, 09 Jul 2025 12:13:41 -0700
I re-read this reply and it’s unreasonably mean/antagonistic so I apologize for that.

The core issue here is that normalization is much more complex than just capitalization - it’s easy to think of capitalization as being the quintessential normalization that everything needs if you’re used to languages where that is essentially the only “no op” normalization, and this is an issue I’ve repeatedly encountered over the years so I think I reflexively responded unreasonably.

I think a more reasonable response would have been:

Real world normalization is non-trivial, and it’s easy for people used to a single form of normalization to forget that other normalization schemes exist that are equivalently important for other languages. Given languages like C and C++ have historically made mistakes in assuming Eurocentric approaches to language (or more realistically anglo-centric) it’s important for us to avoid repeating the mistakes of the past.

The reason ICU has such a complex API is because such an API is necessary for dealing with both real world language and also unicode (whether these are strictly the same is up for debate :D). *If* we were to integrate unicode directly into C++ it would realistically need to be incorporating C++ appropriate interfaces for the core unicode operations, not incorporating the actual tables or specific unicode releases - this would permit the standard library to forward to their own implementation of the unicode standard, the OS implementation, etc (which would allow a binary to be sure its interpretation of unicode matches the OS).

With respect to the examples of dealing with case insensitive file systems I do not believe that unicode normalization is actually the correct behavior. The problem being that different filesystems have different ideas of what an equivalent file name is - older filesystems take an approach similar to tolower/toupper which means that the previously mentioned cases of non-ascii case equivalence would not be correct (especially when considering whether files are meant to be overwritten, etc). The reality is that evolution of file systems over time likely mean that any attempt to do case insensitive searches/comparisons on a given filesystem would likely need to be deferred to the filesystem itself, not the application layer.

I think it’s best to think of this as basically two distinct queries that matter: “does this filename match the query?” (in which case things like the ss vs ß case might matter) or “are these filenames referring to the same file?”. The former is not necessarily solved just by case flattening, the latter requires communicating with the filesystem - not necessarily just applying some arbitrary case flattening of your own selection. If I recall correctly the (higher level) macOS file system APIs allow you to specify predicates to be used when examining file names that abstract the interaction with the underlying filesystem, and I suspect any C++ API that wants to support case sensitive vs insensitive file systems would need to provide that kind of abstraction as well.

Again, apologies for the tone in my original reply.

Cheers,
—Oliver

> On Jul 8, 2025, at 10:19 PM, Oliver Hunt <oliver_at_[hidden]> wrote:
>
>
>
>> On Jul 8, 2025, at 5:25 PM, JJ Marr via Std-Proposals <std-proposals_at_[hidden]> wrote:
>>
>>> Yeah, but... Why do you want it?
>>
>> I want the ability to do case-insensitive comparisons for the majority of human languages and letters without having to think too much about it, get approval to bring in an external library, or deal with implementation-defined behaviour.
>
> We get that you care about character-point based capitalization as that’s the only form of normalization that you believe matters for the majority of human languages, despite largely being restricted to a specific subset of european and mediterranean language families (and even then capitalization is not the only thing you might be expected to normalize: consider things like `ss` vs `ß` which is an equivalence people expect when searching text). I also know that there are normalizations expected for search in other scripts but I only worked extensively on the text entry part of that and my experience of what people expect for text normalization was limited search field behaviour for European languages.
>
> So far you’ve assumed capitalization is trivial, you’ve assumed it’s a universal form of normalization, and that it’s the only normalization that matters and you’ve rejected basically every response that disagrees with that.
>
> Handling unicode is not something that is reasonably offloaded onto the compiler - which is what constexpr-ification requires. Languages that support unicode do that work at runtime due to precisely this complexity.
>
> —Oliver
>
>

Received on 2025-07-09 19:13:54