C++ Logo

std-proposals

Advanced search

Re: [std-proposals] constexpr tolower, toupper, isalpha

From: Jason McKesson <jmckesson_at_[hidden]>
Date: Sun, 6 Jul 2025 10:30:05 -0400
On Sun, Jul 6, 2025 at 9:13 AM David Brown via Std-Proposals
<std-proposals_at_[hidden]> wrote:
> On 06/07/2025 14:38, Frederick Virchanza Gotham via Std-Proposals wrote:
> > On Thu, Jul 3, 2025 at 10:01 AM Jonathan Wakely wrote:
> >>
> >> Meaning that this would fail:
> >>
> >> setlocale(LC_ALL, "de_DE.iso8859-1");
> >> char c uuml = 0xFC; // lowercase u with umlaut
> >> char c = std::toupper(uuml);
> >> constexpr char cc = std::toupper(uuml);
> >> assert( c == cc );
> >
> >
> > With regard to Unicode:
> > * The Standard mentions Unicode and allows for Unicode escape
> > sequences (e.g. "\u00f1")
> > * The first 128 characters in Unicode (up to 0x7F) are ASCII
> > * The remaining characters up to 0xFF are ISO-8859-1 (aka Latin-1)
> >
> > Therefore it makes sense that the Standard would provide inline
> > constexpr functions like:
> >
> > namespace std {
> > namespace unicode {
> > inline constexpr char32_t tolower(char32_t) { . . . }
> > }
> > }
>
> You might think that - it seems reasonable at first sight, for people
> used to nothing but a limited subset of Latin character uses. The
> reality of capitalisation is very much more complicated, however.
> Issues include :
>
> The Latin letter "i" capitalises to "I" in most languages - but in
> Turkish languages, it capitalises to "İ" while "I" is the capital form
> of "ı".
>
> In German languages, "ß" is sometimes capitalised to "ẞ", sometimes to
> the digraph (two letters) "SS".
>
> Many languages have letter combinations that are sometimes capitalised
> together, sometimes not. The Dutch name for the country "Iceland" is
> "IJsland", as the digraph "ij" is treated as a single letter for
> capitalisation purposes.
>
> Converting from capitals to lower case is typically even more
> complicated - the lowercase of the Greek capital "Σ" can be "ς" or "σ"
> depending on its position in the word. The handling of the iota
> subscript in case changes is a typographer's nightmare. Even in some
> names in plain ASCII, there are complications - try converting
> "MacBride" to capitals and back again.
>
> It would be very nice if the it were possible to make such universal
> capitalisation functions like you suggest, but it is not possible.

There's also the fact that such rules are not codepoint-to-codepoint
translations. One codepoint does not necessarily capitalize or
uncapitalize to one codepoint. Sometimes it's a one:many translation.
Sometimes, it's many:one. Sometimes, it's many:many.

Codepoint-based `toupper` and `tolower` are simply not possible in
Unicode no matter how you slice it. You have to provide ranges of
characters, and these functions have to write to a range of
characters.

Received on 2025-07-06 14:30:19