Date: Sun, 6 Jul 2025 16:20:04 -0400
> f) find some official standard how to convert between upper and lowercase
letters/codepoints and follow it
The Unicode standard in 5.18.4[1] defines a "case folding" operation which
is a locale-independent (except optionally for Turkish) way to ignore cases
meant for doing character comparisons. We could easily define a:
```cpp
constexpr toSimpleCasefold(char8_t c) {
}
```
overload defined to be whatever the Unicode standard specifies for the
"simple case folding" operation. The current case normalization is defined
in [2], and the Unicode standard defines a "strong normalization" policy,
so already-assigned characters will never change normalization.[3] There is
also a "complex case folding" operation which can cause strings to grow in
length (e.g. ß -> ss) and some context-specific case mappings. This would
be more challenging to make `constexpr`.
The Unicode standard also prescribes `toUppercase`, `toLowercase`, and
`toTitlecase` functions. These are allowed to change based on local
differences (many are defined in the CLDR), but Unicode provides normative
locale-independent case mappings in the Unicode character
database.[4][5][6] An implementation of the normative simple case mappings
could easily be `constexpr` as well. The special case mappings [5] would be
more difficult for the same reasons as complex case folding in that it
causes strings to change size.
Unicode themselves maintain a C implementation of various operations on
Unicode.[7] There's also a modern public domain C++ library[8] which
provides `constexpr` versions of toLowercase, toUppercase, toTitlecase, etc
for Unicode.[9]
I think it's a very good idea to standardize a modern C++ Unicode library
that provides operations from the Unicode standard and would be happy to
work on the problem, if others think it would be a value-add.
[1]
https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-5/#G21790
[2] https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt
[3] https://www.unicode.org/policies/stability_policy.html
[4]
https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-5/#G21180
[5] https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt
[6] https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
[7] https://github.com/unicode-org/icu/tree/main/icu4c
[8] https://github.com/uni-algo/uni-algo
[9] https://github.com/uni-algo/uni-algo/blob/main/include/uni_algo/case.h
On Sun, Jul 6, 2025 at 10:59 AM Sebastian Wittmeier via Std-Proposals <
std-proposals_at_[hidden]> wrote:
> That could
>
> a) be a reason not to provide such functions
>
>
>
> but also a reason to provide them, because they are a basic building block
> and difficult to do for users; and we already have some Unicode and text
> functions in the standard library
>
>
>
> b) be a reason to provide such functions on a subset (ASCII)
>
>
>
> c) be a reason to provide such functions with basic rules for the
> mentioned examples to get always a n:1 relationship
>
>
>
> d) be a reason to provide a highly configurable function
>
>
>
> e) be a reason to provide the user with the tools to create their own
> functions with less effort
>
>
>
> f) find some official standard how to convert between upper and lowercase
> letters/codepoints and follow it
>
>
>
>
> -----Ursprüngliche Nachricht-----
> *Von:* David Brown via Std-Proposals <std-proposals_at_[hidden]>
> *Gesendet:* So 06.07.2025 15:13
> *Betreff:* Re: [std-proposals] constexpr tolower, toupper, isalpha
> *An:* std-proposals_at_[hidden];
> *CC:* David Brown <david.brown_at_[hidden]>;
>
>
> On 06/07/2025 14:38, Frederick Virchanza Gotham via Std-Proposals wrote:
> > On Thu, Jul 3, 2025 at 10:01 AM Jonathan Wakely wrote:
> >>
> >> Meaning that this would fail:
> >>
> >> setlocale(LC_ALL, "de_DE.iso8859-1");
> >> char c uuml = 0xFC; // lowercase u with umlaut
> >> char c = std::toupper(uuml);
> >> constexpr char cc = std::toupper(uuml);
> >> assert( c == cc );
> >
> >
> > With regard to Unicode:
> > * The Standard mentions Unicode and allows for Unicode escape
> > sequences (e.g. "\u00f1")
> > * The first 128 characters in Unicode (up to 0x7F) are ASCII
> > * The remaining characters up to 0xFF are ISO-8859-1 (aka Latin-1)
> >
> > Therefore it makes sense that the Standard would provide inline
> > constexpr functions like:
> >
> > namespace std {
> > namespace unicode {
> > inline constexpr char32_t tolower(char32_t) { . . . }
> > }
> > }
>
> You might think that - it seems reasonable at first sight, for people
> used to nothing but a limited subset of Latin character uses. The
> reality of capitalisation is very much more complicated, however.
> Issues include :
>
> The Latin letter "i" capitalises to "I" in most languages - but in
> Turkish languages, it capitalises to "İ" while "I" is the capital form
> of "ı".
>
> In German languages, "ß" is sometimes capitalised to "ẞ", sometimes to
> the digraph (two letters) "SS".
>
> Many languages have letter combinations that are sometimes capitalised
> together, sometimes not. The Dutch name for the country "Iceland" is
> "IJsland", as the digraph "ij" is treated as a single letter for
> capitalisation purposes.
>
> Converting from capitals to lower case is typically even more
> complicated - the lowercase of the Greek capital "Σ" can be "ς" or "σ"
> depending on its position in the word. The handling of the iota
> subscript in case changes is a typographer's nightmare. Even in some
> names in plain ASCII, there are complications - try converting
> "MacBride" to capitals and back again.
>
> It would be very nice if the it were possible to make such universal
> capitalisation functions like you suggest, but it is not possible.
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>
letters/codepoints and follow it
The Unicode standard in 5.18.4[1] defines a "case folding" operation which
is a locale-independent (except optionally for Turkish) way to ignore cases
meant for doing character comparisons. We could easily define a:
```cpp
constexpr toSimpleCasefold(char8_t c) {
}
```
overload defined to be whatever the Unicode standard specifies for the
"simple case folding" operation. The current case normalization is defined
in [2], and the Unicode standard defines a "strong normalization" policy,
so already-assigned characters will never change normalization.[3] There is
also a "complex case folding" operation which can cause strings to grow in
length (e.g. ß -> ss) and some context-specific case mappings. This would
be more challenging to make `constexpr`.
The Unicode standard also prescribes `toUppercase`, `toLowercase`, and
`toTitlecase` functions. These are allowed to change based on local
differences (many are defined in the CLDR), but Unicode provides normative
locale-independent case mappings in the Unicode character
database.[4][5][6] An implementation of the normative simple case mappings
could easily be `constexpr` as well. The special case mappings [5] would be
more difficult for the same reasons as complex case folding in that it
causes strings to change size.
Unicode themselves maintain a C implementation of various operations on
Unicode.[7] There's also a modern public domain C++ library[8] which
provides `constexpr` versions of toLowercase, toUppercase, toTitlecase, etc
for Unicode.[9]
I think it's a very good idea to standardize a modern C++ Unicode library
that provides operations from the Unicode standard and would be happy to
work on the problem, if others think it would be a value-add.
[1]
https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-5/#G21790
[2] https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt
[3] https://www.unicode.org/policies/stability_policy.html
[4]
https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-5/#G21180
[5] https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt
[6] https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
[7] https://github.com/unicode-org/icu/tree/main/icu4c
[8] https://github.com/uni-algo/uni-algo
[9] https://github.com/uni-algo/uni-algo/blob/main/include/uni_algo/case.h
On Sun, Jul 6, 2025 at 10:59 AM Sebastian Wittmeier via Std-Proposals <
std-proposals_at_[hidden]> wrote:
> That could
>
> a) be a reason not to provide such functions
>
>
>
> but also a reason to provide them, because they are a basic building block
> and difficult to do for users; and we already have some Unicode and text
> functions in the standard library
>
>
>
> b) be a reason to provide such functions on a subset (ASCII)
>
>
>
> c) be a reason to provide such functions with basic rules for the
> mentioned examples to get always a n:1 relationship
>
>
>
> d) be a reason to provide a highly configurable function
>
>
>
> e) be a reason to provide the user with the tools to create their own
> functions with less effort
>
>
>
> f) find some official standard how to convert between upper and lowercase
> letters/codepoints and follow it
>
>
>
>
> -----Ursprüngliche Nachricht-----
> *Von:* David Brown via Std-Proposals <std-proposals_at_[hidden]>
> *Gesendet:* So 06.07.2025 15:13
> *Betreff:* Re: [std-proposals] constexpr tolower, toupper, isalpha
> *An:* std-proposals_at_[hidden];
> *CC:* David Brown <david.brown_at_[hidden]>;
>
>
> On 06/07/2025 14:38, Frederick Virchanza Gotham via Std-Proposals wrote:
> > On Thu, Jul 3, 2025 at 10:01 AM Jonathan Wakely wrote:
> >>
> >> Meaning that this would fail:
> >>
> >> setlocale(LC_ALL, "de_DE.iso8859-1");
> >> char c uuml = 0xFC; // lowercase u with umlaut
> >> char c = std::toupper(uuml);
> >> constexpr char cc = std::toupper(uuml);
> >> assert( c == cc );
> >
> >
> > With regard to Unicode:
> > * The Standard mentions Unicode and allows for Unicode escape
> > sequences (e.g. "\u00f1")
> > * The first 128 characters in Unicode (up to 0x7F) are ASCII
> > * The remaining characters up to 0xFF are ISO-8859-1 (aka Latin-1)
> >
> > Therefore it makes sense that the Standard would provide inline
> > constexpr functions like:
> >
> > namespace std {
> > namespace unicode {
> > inline constexpr char32_t tolower(char32_t) { . . . }
> > }
> > }
>
> You might think that - it seems reasonable at first sight, for people
> used to nothing but a limited subset of Latin character uses. The
> reality of capitalisation is very much more complicated, however.
> Issues include :
>
> The Latin letter "i" capitalises to "I" in most languages - but in
> Turkish languages, it capitalises to "İ" while "I" is the capital form
> of "ı".
>
> In German languages, "ß" is sometimes capitalised to "ẞ", sometimes to
> the digraph (two letters) "SS".
>
> Many languages have letter combinations that are sometimes capitalised
> together, sometimes not. The Dutch name for the country "Iceland" is
> "IJsland", as the digraph "ij" is treated as a single letter for
> capitalisation purposes.
>
> Converting from capitals to lower case is typically even more
> complicated - the lowercase of the Greek capital "Σ" can be "ς" or "σ"
> depending on its position in the word. The handling of the iota
> subscript in case changes is a typographer's nightmare. Even in some
> names in plain ASCII, there are complications - try converting
> "MacBride" to capitals and back again.
>
> It would be very nice if the it were possible to make such universal
> capitalisation functions like you suggest, but it is not possible.
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>
Received on 2025-07-06 20:20:21