Date: Sun, 6 Jul 2025 15:13:28 +0200
On 06/07/2025 14:38, Frederick Virchanza Gotham via Std-Proposals wrote:
> On Thu, Jul 3, 2025 at 10:01 AM Jonathan Wakely wrote:
>>
>> Meaning that this would fail:
>>
>> setlocale(LC_ALL, "de_DE.iso8859-1");
>> char c uuml = 0xFC; // lowercase u with umlaut
>> char c = std::toupper(uuml);
>> constexpr char cc = std::toupper(uuml);
>> assert( c == cc );
>
>
> With regard to Unicode:
> * The Standard mentions Unicode and allows for Unicode escape
> sequences (e.g. "\u00f1")
> * The first 128 characters in Unicode (up to 0x7F) are ASCII
> * The remaining characters up to 0xFF are ISO-8859-1 (aka Latin-1)
>
> Therefore it makes sense that the Standard would provide inline
> constexpr functions like:
>
> namespace std {
> namespace unicode {
> inline constexpr char32_t tolower(char32_t) { . . . }
> }
> }
You might think that - it seems reasonable at first sight, for people
used to nothing but a limited subset of Latin character uses. The
reality of capitalisation is very much more complicated, however.
Issues include :
The Latin letter "i" capitalises to "I" in most languages - but in
Turkish languages, it capitalises to "İ" while "I" is the capital form
of "ı".
In German languages, "ß" is sometimes capitalised to "ẞ", sometimes to
the digraph (two letters) "SS".
Many languages have letter combinations that are sometimes capitalised
together, sometimes not. The Dutch name for the country "Iceland" is
"IJsland", as the digraph "ij" is treated as a single letter for
capitalisation purposes.
Converting from capitals to lower case is typically even more
complicated - the lowercase of the Greek capital "Σ" can be "ς" or "σ"
depending on its position in the word. The handling of the iota
subscript in case changes is a typographer's nightmare. Even in some
names in plain ASCII, there are complications - try converting
"MacBride" to capitals and back again.
It would be very nice if the it were possible to make such universal
capitalisation functions like you suggest, but it is not possible.
> On Thu, Jul 3, 2025 at 10:01 AM Jonathan Wakely wrote:
>>
>> Meaning that this would fail:
>>
>> setlocale(LC_ALL, "de_DE.iso8859-1");
>> char c uuml = 0xFC; // lowercase u with umlaut
>> char c = std::toupper(uuml);
>> constexpr char cc = std::toupper(uuml);
>> assert( c == cc );
>
>
> With regard to Unicode:
> * The Standard mentions Unicode and allows for Unicode escape
> sequences (e.g. "\u00f1")
> * The first 128 characters in Unicode (up to 0x7F) are ASCII
> * The remaining characters up to 0xFF are ISO-8859-1 (aka Latin-1)
>
> Therefore it makes sense that the Standard would provide inline
> constexpr functions like:
>
> namespace std {
> namespace unicode {
> inline constexpr char32_t tolower(char32_t) { . . . }
> }
> }
You might think that - it seems reasonable at first sight, for people
used to nothing but a limited subset of Latin character uses. The
reality of capitalisation is very much more complicated, however.
Issues include :
The Latin letter "i" capitalises to "I" in most languages - but in
Turkish languages, it capitalises to "İ" while "I" is the capital form
of "ı".
In German languages, "ß" is sometimes capitalised to "ẞ", sometimes to
the digraph (two letters) "SS".
Many languages have letter combinations that are sometimes capitalised
together, sometimes not. The Dutch name for the country "Iceland" is
"IJsland", as the digraph "ij" is treated as a single letter for
capitalisation purposes.
Converting from capitals to lower case is typically even more
complicated - the lowercase of the Greek capital "Σ" can be "ς" or "σ"
depending on its position in the word. The handling of the iota
subscript in case changes is a typographer's nightmare. Even in some
names in plain ASCII, there are complications - try converting
"MacBride" to capitals and back again.
It would be very nice if the it were possible to make such universal
capitalisation functions like you suggest, but it is not possible.
Received on 2025-07-06 13:13:35