C++ Logo

std-proposals

Advanced search

Re: [std-proposals] constexpr tolower, toupper, isalpha

From: Tiago Freire <tmiguelf_at_[hidden]>
Date: Sun, 6 Jul 2025 21:52:39 +0000
I have to agree with the general sentiment for not doing it.
It's kind of complicated, and truly the best way to deal with capitalization is to "not deal with capitalization".

It's actually extremely rare the amount of applications were dealing with capitalization is actually important.

That being localization, not a whole lot of applications actually supporting it (at least beyond the realm of prepared statements).
And those that seriously do often have large teams, with a couple them dealing specifically with this sort of problems, and they have solutions tailored to work on very specific problems, without a solution that applies to absolutely everything.


________________________________
From: Std-Proposals <std-proposals-bounces_at_[hidden]> on behalf of JJ Marr via Std-Proposals <std-proposals_at_[hidden]>
Sent: Sunday, July 6, 2025 9:20:27 PM
To: std-proposals_at_[hidden]ocpp.org <std-proposals_at_[hidden]>
Cc: JJ Marr <jjmarr_at_[hidden]>
Subject: Re: [std-proposals] constexpr tolower, toupper, isalpha

> f) find some official standard how to convert between upper and lowercase letters/codepoints and follow it

The Unicode standard in 5.18.4[1] defines a "case folding" operation which is a locale-independent (except optionally for Turkish) way to ignore cases meant for doing character comparisons. We could easily define a:
```cpp
constexpr toSimpleCasefold(char8_t c) {
}
```
overload defined to be whatever the Unicode standard specifies for the "simple case folding" operation. The current case normalization is defined in [2], and the Unicode standard defines a "strong normalization" policy, so already-assigned characters will never change normalization.[3] There is also a "complex case folding" operation which can cause strings to grow in length (e.g. ß -> ss) and some context-specific case mappings. This would be more challenging to make `constexpr`.

The Unicode standard also prescribes `toUppercase`, `toLowercase`, and `toTitlecase` functions. These are allowed to change based on local differences (many are defined in the CLDR), but Unicode provides normative locale-independent case mappings in the Unicode character database.[4][5][6] An implementation of the normative simple case mappings could easily be `constexpr` as well. The special case mappings [5] would be more difficult for the same reasons as complex case folding in that it causes strings to change size.

Unicode themselves maintain a C implementation of various operations on Unicode.[7] There's also a modern public domain C++ library[8] which provides `constexpr` versions of toLowercase, toUppercase, toTitlecase, etc for Unicode.[9]

I think it's a very good idea to standardize a modern C++ Unicode library that provides operations from the Unicode standard and would be happy to work on the problem, if others think it would be a value-add.

[1] https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-5/#G21790

[2] https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt

[3] https://www.unicode.org/policies/stability_policy.html

[4] https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-5/#G21180

[5] https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt

[6] https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt

[7] https://github.com/unicode-org/icu/tree/main/icu4c

[8] https://github.com/uni-algo/uni-algo

[9] https://github.com/uni-algo/uni-algo/blob/main/include/uni_algo/case.h

On Sun, Jul 6, 2025 at 10:59 AM Sebastian Wittmeier via Std-Proposals <std-proposals_at_[hidden]<mailto:std-proposals_at_[hidden]>> wrote:

That could

a) be a reason not to provide such functions



but also a reason to provide them, because they are a basic building block and difficult to do for users; and we already have some Unicode and text functions in the standard library



b) be a reason to provide such functions on a subset (ASCII)



c) be a reason to provide such functions with basic rules for the mentioned examples to get always a n:1 relationship



d) be a reason to provide a highly configurable function



e) be a reason to provide the user with the tools to create their own functions with less effort



f) find some official standard how to convert between upper and lowercase letters/codepoints and follow it



-----Ursprüngliche Nachricht-----
Von: David Brown via Std-Proposals <std-proposals_at_[hidden]<mailto:std-proposals_at_[hidden]>>
Gesendet: So 06.07.2025 15:13
Betreff: Re: [std-proposals] constexpr tolower, toupper, isalpha
An: std-proposals_at_[hidden]cpp.org<mailto:std-proposals_at_[hidden]>;
CC: David Brown <david.brown_at_[hidden]<mailto:david.brown_at_[hidden]>>;


On 06/07/2025 14:38, Frederick Virchanza Gotham via Std-Proposals wrote:
> On Thu, Jul 3, 2025 at 10:01 AM Jonathan Wakely wrote:
>>
>> Meaning that this would fail:
>>
>> setlocale(LC_ALL, "de_DE.iso8859-1");
>> char c uuml = 0xFC; // lowercase u with umlaut
>> char c = std::toupper(uuml);
>> constexpr char cc = std::toupper(uuml);
>> assert( c == cc );
>
>
> With regard to Unicode:
> * The Standard mentions Unicode and allows for Unicode escape
> sequences (e.g. "\u00f1")
> * The first 128 characters in Unicode (up to 0x7F) are ASCII
> * The remaining characters up to 0xFF are ISO-8859-1 (aka Latin-1)
>
> Therefore it makes sense that the Standard would provide inline
> constexpr functions like:
>
> namespace std {
> namespace unicode {
> inline constexpr char32_t tolower(char32_t) { . . . }
> }
> }

You might think that - it seems reasonable at first sight, for people
used to nothing but a limited subset of Latin character uses. The
reality of capitalisation is very much more complicated, however.
Issues include :

The Latin letter "i" capitalises to "I" in most languages - but in
Turkish languages, it capitalises to "İ" while "I" is the capital form
of "ı".

In German languages, "ß" is sometimes capitalised to "ẞ", sometimes to
the digraph (two letters) "SS".

Many languages have letter combinations that are sometimes capitalised
together, sometimes not. The Dutch name for the country "Iceland" is
"IJsland", as the digraph "ij" is treated as a single letter for
capitalisation purposes.

Converting from capitals to lower case is typically even more
complicated - the lowercase of the Greek capital "Σ" can be "ς" or "σ"
depending on its position in the word. The handling of the iota
subscript in case changes is a typographer's nightmare. Even in some
names in plain ASCII, there are complications - try converting
"MacBride" to capitals and back again.

It would be very nice if the it were possible to make such universal
capitalisation functions like you suggest, but it is not possible.
--
Std-Proposals mailing list
Std-Proposals_at_[hidden]<mailto:Std-Proposals_at_[hidden]>
https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
--
Std-Proposals mailing list
Std-Proposals_at_[hidden]<mailto:Std-Proposals_at_[hidden]>
https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals

Received on 2025-07-06 21:52:46