C++ Logo

std-proposals

Advanced search

Re: [std-proposals] constexpr tolower, toupper, isalpha

From: Julien Villemure-Fréchette <julien.villemure_at_[hidden]>
Date: Mon, 07 Jul 2025 13:45:20 -0400
+1

Even within Unicode, the "toupper/tolower" operations on characters is not universal: it is locale dependent. Although the classification of characters as being uppercase or lowercase is universal (ie, the character "X" is uppercase, and "x" is lowercase), the transformation of a character to lowercase or to uppercase is locale dependent and, as pointed out it may also produce more than a single character or may transform successive characters differently than transforming those characters independently.
 

On July 6, 2025 9:13:28 a.m. EDT, David Brown via Std-Proposals <std-proposals_at_[hidden]> wrote:
>
>
>On 06/07/2025 14:38, Frederick Virchanza Gotham via Std-Proposals wrote:
>> On Thu, Jul 3, 2025 at 10:01 AM Jonathan Wakely wrote:
>>>
>>> Meaning that this would fail:
>>>
>>> setlocale(LC_ALL, "de_DE.iso8859-1");
>>> char c uuml = 0xFC; // lowercase u with umlaut
>>> char c = std::toupper(uuml);
>>> constexpr char cc = std::toupper(uuml);
>>> assert( c == cc );
>>
>>
>> With regard to Unicode:
>> * The Standard mentions Unicode and allows for Unicode escape
>> sequences (e.g. "\u00f1")
>> * The first 128 characters in Unicode (up to 0x7F) are ASCII
>> * The remaining characters up to 0xFF are ISO-8859-1 (aka Latin-1)
>>
>> Therefore it makes sense that the Standard would provide inline
>> constexpr functions like:
>>
>> namespace std {
>> namespace unicode {
>> inline constexpr char32_t tolower(char32_t) { . . . }
>> }
>> }
>
>You might think that - it seems reasonable at first sight, for people used to nothing but a limited subset of Latin character uses. The reality of capitalisation is very much more complicated, however. Issues include :
>
>The Latin letter "i" capitalises to "I" in most languages - but in Turkish languages, it capitalises to "İ" while "I" is the capital form of "ı".
>
>In German languages, "ß" is sometimes capitalised to "ẞ", sometimes to the digraph (two letters) "SS".
>
>Many languages have letter combinations that are sometimes capitalised together, sometimes not. The Dutch name for the country "Iceland" is "IJsland", as the digraph "ij" is treated as a single letter for capitalisation purposes.
>
>Converting from capitals to lower case is typically even more complicated - the lowercase of the Greek capital "Σ" can be "ς" or "σ" depending on its position in the word. The handling of the iota subscript in case changes is a typographer's nightmare. Even in some names in plain ASCII, there are complications - try converting "MacBride" to capitals and back again.
>
>It would be very nice if the it were possible to make such universal capitalisation functions like you suggest, but it is not possible.
>--
>Std-Proposals mailing list
>Std-Proposals_at_[hidden]
>https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals

Received on 2025-07-07 17:45:29