C++ Logo

std-proposals

Advanced search

Re: [std-proposals] constexpr tolower, toupper, isalpha

From: JJ Marr <jjmarr_at_[hidden]>
Date: Sun, 6 Jul 2025 18:41:29 -0400
> I have to agree with the general sentiment for not doing it.
> It's kind of complicated, and truly the best way to deal with
capitalization is to "not deal with capitalization".

We wouldn't have to do it. We can just implement the reasonably safe
defaults in the Unicode standard. If you want your characters to not have
capitalization, you can call `std::toCasefold` before doing comparisons,
which implements simple case folding from the Unicode standard. Mission
accomplished for most uses of `std::toLower`.

Implementation is trivial because Unicode provides a `CaseFolding.txt` file
which is a mapping of all Unicode code points to a "case folded" code
point. Comparing two codepoints after "case folding" is the recommended way
of doing case-insensitive comparisons by Unicode, because it avoids
cultural ambiguity on what is considered to be "lowercase" or "uppercase".

> Codepoint-based `toupper` and `tolower` are simply not possible in
> Unicode no matter how you slice it. You have to provide ranges of
> characters, and these functions have to write to a range of
> characters.

"Simple case folding" is for the scenario when we must avoid a "1-to-many"
mapping in a given application.

On Sun, Jul 6, 2025 at 5:52 PM Tiago Freire <tmiguelf_at_[hidden]> wrote:

> I have to agree with the general sentiment for not doing it.
> It's kind of complicated, and truly the best way to deal with
> capitalization is to "not deal with capitalization".
>
> It's actually extremely rare the amount of applications were dealing with
> capitalization is actually important.
>
> That being localization, not a whole lot of applications actually
> supporting it (at least beyond the realm of prepared statements).
> And those that seriously do often have large teams, with a couple them
> dealing specifically with this sort of problems, and they have solutions
> tailored to work on very specific problems, without a solution that applies
> to absolutely everything.
>
>
> ------------------------------
> *From:* Std-Proposals <std-proposals-bounces_at_[hidden]> on behalf
> of JJ Marr via Std-Proposals <std-proposals_at_[hidden]>
> *Sent:* Sunday, July 6, 2025 9:20:27 PM
> *To:* std-proposals_at_[hidden] <std-proposals_at_[hidden]>
> *Cc:* JJ Marr <jjmarr_at_[hidden]>
> *Subject:* Re: [std-proposals] constexpr tolower, toupper, isalpha
>
> > f) find some official standard how to convert between upper and
> lowercase letters/codepoints and follow it
>
> The Unicode standard in 5.18.4[1] defines a "case folding" operation which
> is a locale-independent (except optionally for Turkish) way to ignore cases
> meant for doing character comparisons. We could easily define a:
> ```cpp
> constexpr toSimpleCasefold(char8_t c) {
> }
> ```
> overload defined to be whatever the Unicode standard specifies for the
> "simple case folding" operation. The current case normalization is defined
> in [2], and the Unicode standard defines a "strong normalization" policy,
> so already-assigned characters will never change normalization.[3] There is
> also a "complex case folding" operation which can cause strings to grow in
> length (e.g. ß -> ss) and some context-specific case mappings. This would
> be more challenging to make `constexpr`.
>
> The Unicode standard also prescribes `toUppercase`, `toLowercase`, and
> `toTitlecase` functions. These are allowed to change based on local
> differences (many are defined in the CLDR), but Unicode provides normative
> locale-independent case mappings in the Unicode character
> database.[4][5][6] An implementation of the normative simple case mappings
> could easily be `constexpr` as well. The special case mappings [5] would be
> more difficult for the same reasons as complex case folding in that it
> causes strings to change size.
>
> Unicode themselves maintain a C implementation of various operations on
> Unicode.[7] There's also a modern public domain C++ library[8] which
> provides `constexpr` versions of toLowercase, toUppercase, toTitlecase, etc
> for Unicode.[9]
>
> I think it's a very good idea to standardize a modern C++ Unicode library
> that provides operations from the Unicode standard and would be happy to
> work on the problem, if others think it would be a value-add.
>
> [1]
> https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-5/#G21790
>
> [2] https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt
>
> [3] https://www.unicode.org/policies/stability_policy.html
>
> [4]
> https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-5/#G21180
>
> [5] https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt
>
> [6] https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
>
> [7] https://github.com/unicode-org/icu/tree/main/icu4c
>
> [8] https://github.com/uni-algo/uni-algo
>
> [9] https://github.com/uni-algo/uni-algo/blob/main/include/uni_algo/case.h
>
> On Sun, Jul 6, 2025 at 10:59 AM Sebastian Wittmeier via Std-Proposals <
> std-proposals_at_[hidden]> wrote:
>
> That could
>
> a) be a reason not to provide such functions
>
>
>
> but also a reason to provide them, because they are a basic building block
> and difficult to do for users; and we already have some Unicode and text
> functions in the standard library
>
>
>
> b) be a reason to provide such functions on a subset (ASCII)
>
>
>
> c) be a reason to provide such functions with basic rules for the
> mentioned examples to get always a n:1 relationship
>
>
>
> d) be a reason to provide a highly configurable function
>
>
>
> e) be a reason to provide the user with the tools to create their own
> functions with less effort
>
>
>
> f) find some official standard how to convert between upper and lowercase
> letters/codepoints and follow it
>
>
>
>
> -----Ursprüngliche Nachricht-----
> *Von:* David Brown via Std-Proposals <std-proposals_at_[hidden]>
> *Gesendet:* So 06.07.2025 15:13
> *Betreff:* Re: [std-proposals] constexpr tolower, toupper, isalpha
> *An:* std-proposals_at_[hidden];
> *CC:* David Brown <david.brown_at_[hidden]>;
>
>
> On 06/07/2025 14:38, Frederick Virchanza Gotham via Std-Proposals wrote:
> > On Thu, Jul 3, 2025 at 10:01 AM Jonathan Wakely wrote:
> >>
> >> Meaning that this would fail:
> >>
> >> setlocale(LC_ALL, "de_DE.iso8859-1");
> >> char c uuml = 0xFC; // lowercase u with umlaut
> >> char c = std::toupper(uuml);
> >> constexpr char cc = std::toupper(uuml);
> >> assert( c == cc );
> >
> >
> > With regard to Unicode:
> > * The Standard mentions Unicode and allows for Unicode escape
> > sequences (e.g. "\u00f1")
> > * The first 128 characters in Unicode (up to 0x7F) are ASCII
> > * The remaining characters up to 0xFF are ISO-8859-1 (aka Latin-1)
> >
> > Therefore it makes sense that the Standard would provide inline
> > constexpr functions like:
> >
> > namespace std {
> > namespace unicode {
> > inline constexpr char32_t tolower(char32_t) { . . . }
> > }
> > }
>
> You might think that - it seems reasonable at first sight, for people
> used to nothing but a limited subset of Latin character uses. The
> reality of capitalisation is very much more complicated, however.
> Issues include :
>
> The Latin letter "i" capitalises to "I" in most languages - but in
> Turkish languages, it capitalises to "İ" while "I" is the capital form
> of "ı".
>
> In German languages, "ß" is sometimes capitalised to "ẞ", sometimes to
> the digraph (two letters) "SS".
>
> Many languages have letter combinations that are sometimes capitalised
> together, sometimes not. The Dutch name for the country "Iceland" is
> "IJsland", as the digraph "ij" is treated as a single letter for
> capitalisation purposes.
>
> Converting from capitals to lower case is typically even more
> complicated - the lowercase of the Greek capital "Σ" can be "ς" or "σ"
> depending on its position in the word. The handling of the iota
> subscript in case changes is a typographer's nightmare. Even in some
> names in plain ASCII, there are complications - try converting
> "MacBride" to capitals and back again.
>
> It would be very nice if the it were possible to make such universal
> capitalisation functions like you suggest, but it is not possible.
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>
>
>

Received on 2025-07-06 22:41:46