ISOCPP std-proposals List: Re: [std-proposals] constexpr tolower, toupper, isalpha

From: Tiago Freire <tmiguelf_at_[hidden]>
Date: Wed, 9 Jul 2025 14:30:37 +0000

> Part of the reason there's so many diverging implementations of "case insensitivity" is the lack of standardization. Here, the standard exists (Unicode). It is freely accessible, portable, and handles all of the edge cases we are discussing. We just need to implement it.

Yes, but what I wanted to point is that a reason to want this feature is to deal with these interfaces, and these interfaces will not change because this feature is provided.
The argument being providing this feature in order to use these interfaces is misguided as it will not provide the correct "case-insensitive" handling. It sounds similar enough for people to confuse it and shoot themselves in the foot.
I.e. not a valid reason.

> I write a command-line applications at my day job. Sometimes I want user input/arguments to be case-insensitive because it's less stuff to remember. It's not a word processor or anything complex.

Which is a valid although unusual in the command line processing space.
And given you only have to deal with a small subset of words you need to understand, and as you have stated, a small subset of characters in the ASCII range is a good enough implementation to cover all cases.
And then again you only have to do that conversion at the moment the user inputs their text... not a compile time thing. So, no constexpr required.

Hence why I'm skeptical of the need for such a constexpr facility. Runtime conversion maybe, but constexpr? If there is really a use case for it, I'm yet to find one.

________________________________
From: JJ Marr <jjmarr_at_[hidden]>
Sent: Wednesday, July 9, 2025 1:26:02 AM
To: std-proposals_at_[hidden] <std-proposals_at_[hidden]>
Cc: Tiago Freire <tmiguelf_at_[hidden]>
Subject: Re: [std-proposals] constexpr tolower, toupper, isalpha

> Yeah, but... Why do you want it?

I want the ability to do case-insensitive comparisons for the majority of human languages and letters without having to think too much about it, get approval to bring in an external library, or deal with implementation-defined behaviour.

> the reality is they don't implement the full Unicode spec, it's not truly "case-insensitive" it's more of a "whatever code ranges that system considers to be the same"

Part of the reason there's so many diverging implementations of "case insensitivity" is the lack of standardization. Here, the standard exists (Unicode). It is freely accessible, portable, and handles all of the edge cases we are discussing. We just need to implement it.

> You are likely dealing with user input and transforming text

I write a command-line applications at my day job. Sometimes I want user input/arguments to be case-insensitive because it's less stuff to remember. It's not a word processor or anything complex.

Right now, I would roll my own toLowercase algorithm and subtract `32` from capital ASCII letters because my university taught that as acceptable.

If there was a C++ standardized simple case folding algorithm, I would use that instead.

This would give me safer comparisons for everything except ß and dotless I.

On Tue, Jul 8, 2025, 7:13 p.m. Tiago Freire via Std-Proposals <std-proposals_at_[hidden]<mailto:std-proposals_at_[hidden]>> wrote:
Yeah, but... Why do you want it?
Most cases of localization in applications can be achieved with prepared phrase books, no case-conversion required.

Tempting as it may to use it to deal with "case-insensitive" systems (either it being to deal with domain names, databases, or files systems), the reality is they don't implement the full Unicode spec, it's not truly "case-insensitive" it's more of a "whatever code ranges that system considers to be the same" (if not sometimes can only be tested at runtime like NTFS) , and that solution is not going to be the same depending on the system. So, using the standard to provide a "to_lower" to do that job instead of a "is_same_path" is misguided.

The real application I see for this would require:

1.
dealing with words that aren't prepared (and are thus not likely to be known at compile, thus making constexpr useless at best).
2.
having the need to do the conversion.

You are likely dealing with user input and transforming text, your application is probably a word processing like libre office. Or you are probably dealing with finicky government databases that needs to do case conversion of names for some reason. Or possibly you want to implement a search in a database.
All of those would generally have their own solutions and deal with the problem specifically for that application, and there will probably be a need for a localization expert in the team.

It feels like there should be a large use case for this, but when you look into it, it's really hard to find examples where you actually need it and was the right tool

________________________________
From: Std-Proposals <std-proposals-bounces_at_[hidden]<mailto:std-proposals-bounces_at_[hidden]>> on behalf of Jason McKesson via Std-Proposals <std-proposals_at_[hidden]<mailto:std-proposals_at_[hidden]>>
Sent: Tuesday, July 8, 2025 7:20:41 PM
To: std-proposals_at_[hidden]<mailto:std-proposals_at_[hidden]> <std-proposals_at_[hidden]<mailto:std-proposals_at_[hidden]>>
Cc: Jason McKesson <jmckesson_at_[hidden]<mailto:jmckesson_at_[hidden]>>
Subject: Re: [std-proposals] constexpr tolower, toupper, isalpha

On Tue, Jul 8, 2025 at 1:55 PM Thiago Macieira via Std-Proposals
<std-proposals_at_[hidden]<mailto:std-proposals_at_[hidden]>> wrote:
>
> On Tuesday, 8 July 2025 09:49:33 Pacific Daylight Time JJ Marr via Std-
> Proposals wrote:
> > CaseFolding.txt is a 87 KB text file, most of which is comments. It's about
> > 1654 lines, so assuming two UTF-32 characters a line that's a little under
> > 13 KiB.
>
> They also appear in ranges with predictable changes, like adding or
> subtracting 0x20. That means the codegen can be significantly better than one
> 13 kB table.
>
> > Of course, this includes complex case mappings of one codepoint to multiple
> > codepoints. If we drop those, we can make the table a bit smaller.
>
> You can't drop them. Case-mapping is a string operation.

But *simple* case folding is not. The term "simple case folding" is a
specific Unicode-defined subset of general case folding that is
locale-independent and only provides 1:1 mapping of codepoints.

Though not 1:1 mapping of any particular *encoding* of codepoints. A
UTF-8 string after simple case folding may not be the same encoded
length as one before.
--
Std-Proposals mailing list
Std-Proposals_at_[hidden]<mailto:Std-Proposals_at_[hidden]>
https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals

--
Std-Proposals mailing list
Std-Proposals_at_[hidden]<mailto:Std-Proposals_at_[hidden]>
https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals

Received on 2025-07-09 14:30:41