C++ Logo

std-proposals

Advanced search

Re: [std-proposals] constexpr tolower, toupper, isalpha

From: Sebastian Wittmeier <wittmeier_at_[hidden]>
Date: Sun, 6 Jul 2025 16:50:32 +0200
That could a) be a reason not to provide such functions   but also a reason to provide them, because they are a basic building block and difficult to do for users; and we already have some Unicode and text functions in the standard library   b) be a reason to provide such functions on a subset (ASCII)   c) be a reason to provide such functions with basic rules for the mentioned examples to get always a n:1 relationship   d) be a reason to provide a highly configurable function   e) be a reason to provide the user with the tools to create their own functions with less effort   f) find some official standard how to convert between upper and lowercase letters/codepoints and follow it   -----Ursprüngliche Nachricht----- Von:David Brown via Std-Proposals <std-proposals_at_[hidden]> Gesendet:So 06.07.2025 15:13 Betreff:Re: [std-proposals] constexpr tolower, toupper, isalpha An:std-proposals_at_[hidden]; CC:David Brown <david.brown_at_[hidden]>; On 06/07/2025 14:38, Frederick Virchanza Gotham via Std-Proposals wrote: > On Thu, Jul 3, 2025 at 10:01 AM Jonathan Wakely wrote: >> >> Meaning that this would fail: >> >> setlocale(LC_ALL, "de_DE.iso8859-1"); >> char c uuml = 0xFC; // lowercase u with umlaut >> char c = std::toupper(uuml); >> constexpr char cc = std::toupper(uuml); >> assert( c == cc ); > > > With regard to Unicode: > * The Standard mentions Unicode and allows for Unicode escape > sequences (e.g. "\u00f1") > * The first 128 characters in Unicode (up to 0x7F) are ASCII > * The remaining characters up to 0xFF are ISO-8859-1 (aka Latin-1) > > Therefore it makes sense that the Standard would provide inline > constexpr functions like: > >      namespace std { >          namespace unicode { >              inline constexpr char32_t tolower(char32_t) {    . . .    } >          } >      } You might think that - it seems reasonable at first sight, for people used to nothing but a limited subset of Latin character uses.  The reality of capitalisation is very much more complicated, however. Issues include : The Latin letter "i" capitalises to "I" in most languages - but in Turkish languages, it capitalises to "İ" while "I" is the capital form of "ı". In German languages, "ß" is sometimes capitalised to "ẞ", sometimes to the digraph (two letters) "SS". Many languages have letter combinations that are sometimes capitalised together, sometimes not.  The Dutch name for the country "Iceland" is "IJsland", as the digraph "ij" is treated as a single letter for capitalisation purposes. Converting from capitals to lower case is typically even more complicated - the lowercase of the Greek capital "Σ" can be "ς" or "σ" depending on its position in the word.  The handling of the iota subscript in case changes is a typographer's nightmare.  Even in some names in plain ASCII, there are complications - try converting "MacBride" to capitals and back again. It would be very nice if the it were possible to make such universal capitalisation functions like you suggest, but it is not possible. -- Std-Proposals mailing list Std-Proposals_at_[hidden] https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals

Received on 2025-07-06 14:59:41