C++ Logo

std-proposals

Advanced search

Re: [std-proposals] constexpr tolower, toupper, isalpha

From: JJ Marr <jjmarr_at_[hidden]>
Date: Tue, 8 Jul 2025 20:25:48 -0400
> Yeah, but... Why do you want it?

I want the ability to do case-insensitive comparisons for the majority of
human languages and letters without having to think too much about it, get
approval to bring in an external library, or deal with
implementation-defined behaviour.

> the reality is they don't implement the full Unicode spec, it's not truly
"case-insensitive" it's more of a "whatever code ranges that system
considers to be the same"

Part of the reason there's so many diverging implementations of "case
insensitivity" is the lack of standardization. Here, the standard exists
(Unicode). It is freely accessible, portable, and handles all of the edge
cases we are discussing. We just need to implement it.

> You are likely dealing with user input and transforming text

I write a command-line applications at my day job. Sometimes I want user
input/arguments to be case-insensitive because it's less stuff to remember.
It's not a word processor or anything complex.

Right now, I would roll my own toLowercase algorithm and subtract `32` from
capital ASCII letters because my university taught that as acceptable.

If there was a C++ standardized simple case folding algorithm, I would use
that instead.

This would give me safer comparisons for everything except ß and dotless I.

On Tue, Jul 8, 2025, 7:13 p.m. Tiago Freire via Std-Proposals <
std-proposals_at_[hidden]> wrote:

> Yeah, but... Why do you want it?
> Most cases of localization in applications can be achieved with prepared
> phrase books, no case-conversion required.
>
> Tempting as it may to use it to deal with "case-insensitive" systems
> (either it being to deal with domain names, databases, or files systems),
> the reality is they don't implement the full Unicode spec, it's not truly
> "case-insensitive" it's more of a "whatever code ranges that system
> considers to be the same" (if not sometimes can only be tested at runtime
> like NTFS) , and that solution is not going to be the same depending on the
> system. So, using the standard to provide a "to_lower" to do that job
> instead of a "is_same_path" is misguided.
>
> The real application I see for this would require:
>
> 1. dealing with words that aren't prepared (and are thus not likely to
> be known at compile, thus making constexpr useless at best).
> 2. having the need to do the conversion.
>
> You are likely dealing with user input and transforming text, your
> application is probably a word processing like libre office. Or you are
> probably dealing with finicky government databases that needs to do case
> conversion of names for some reason. Or possibly you want to implement a
> search in a database.
> All of those would generally have their own solutions and deal with the
> problem specifically for that application, and there will probably be a
> need for a localization expert in the team.
>
> It feels like there should be a large use case for this, but when you look
> into it, it's really hard to find examples where you actually need it and
> was the right tool
>
>
>
>
> ------------------------------
> *From:* Std-Proposals <std-proposals-bounces_at_[hidden]> on behalf
> of Jason McKesson via Std-Proposals <std-proposals_at_[hidden]>
> *Sent:* Tuesday, July 8, 2025 7:20:41 PM
> *To:* std-proposals_at_[hidden] <std-proposals_at_[hidden]>
> *Cc:* Jason McKesson <jmckesson_at_[hidden]>
> *Subject:* Re: [std-proposals] constexpr tolower, toupper, isalpha
>
> On Tue, Jul 8, 2025 at 1:55 PM Thiago Macieira via Std-Proposals
> <std-proposals_at_[hidden]> wrote:
> >
> > On Tuesday, 8 July 2025 09:49:33 Pacific Daylight Time JJ Marr via Std-
> > Proposals wrote:
> > > CaseFolding.txt is a 87 KB text file, most of which is comments. It's
> about
> > > 1654 lines, so assuming two UTF-32 characters a line that's a little
> under
> > > 13 KiB.
> >
> > They also appear in ranges with predictable changes, like adding or
> > subtracting 0x20. That means the codegen can be significantly better
> than one
> > 13 kB table.
> >
> > > Of course, this includes complex case mappings of one codepoint to
> multiple
> > > codepoints. If we drop those, we can make the table a bit smaller.
> >
> > You can't drop them. Case-mapping is a string operation.
>
> But *simple* case folding is not. The term "simple case folding" is a
> specific Unicode-defined subset of general case folding that is
> locale-independent and only provides 1:1 mapping of codepoints.
>
> Though not 1:1 mapping of any particular *encoding* of codepoints. A
> UTF-8 string after simple case folding may not be the same encoded
> length as one before.
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>

Received on 2025-07-09 00:26:01