Date: Tue, 8 Jul 2025 12:49:33 -0400
The authoritative tables for case mapping are found at
https://www.unicode.org/Public/16.0.0/ucd/, specifically CaseFolding.txt
for case folding. It's all premade.
CaseFolding.txt is a 87 KB text file, most of which is comments. It's about
1654 lines, so assuming two UTF-32 characters a line that's a little under
13 KiB.
Of course, this includes complex case mappings of one codepoint to multiple
codepoints. If we drop those, we can make the table a bit smaller.
On Tue, Jul 8, 2025, 12:37 p.m. David Brown via Std-Proposals <
std-proposals_at_[hidden]> wrote:
>
>
> On 08/07/2025 17:23, Thiago Macieira via Std-Proposals wrote:
> > On Tuesday, 8 July 2025 01:24:37 Pacific Daylight Time David Brown via
> Std-
> > Proposals wrote:
> >> OK, that makes it less of an issue. But people would still have to
> >> update the C++ toolchain for new Unicode characters, rather than just
> >> pulling the latest versions of the Unicode functions from the github
> >> repository.
> >
> > If there's such a thing as a database you can download from GitHub or
> > elsewhere, then the functionality doesn't need to be in the Standard nor
> > should it. It should be a simple library that one *can* update at will.
> >
>
> There is <https://github.com/unicode-org>, but I have not looked at what
> is on it. I would guess that the tables needed for doing case folding
> are there somewhere, as well as tables for character classification.
>
> > The challenge is having those large tables in constexpr form and
> ensuring that
> > they run in reasonable number of steps so people don't have to add
> command-
> > line switches to their builds.
>
> Sure. As I understand it (and this is getting significantly beyond the
> details I am sure of), simple Unicode case-folding can be done character
> by character. Basically, you'll need a big array of type char32_t,
> indexed by the character codes. Handling such a thing sounds like an
> ideal use for C++26 #embed. But you would need the tables in an
> appropriate form first, or it would quickly become inefficient at
> compile time. (If you tell me this is impractical or impossible, I'll
> take your word for it - this is getting quite speculative on my part.)
>
> >
> >> I think the key users here are those who are okay with half-baked
> >> solutions but need more than ASCII - those targeting just one language
> >> or script, but not plain ASCII English. It is quite easy to imagine
> >> that the programming world is divided into ASCII-only US English-only
> >> and full multi-lingual multi-script international code. The reality is
> >> that for a lot programming around the world, code is single-language
> >> single-script single character encoding, but that language is not
> >> English and that character encoding is UTF-8 and not just plain ASCII.
> >>
> >> I live in Norway. So for much of what I write, I want support for the
> >> Norwegian characters Æ, Ø and Å, and their lower-case æ, ø and å. I
> >> also want support for characters in English that are used occasionally -
> >> I want to write "naïve" and "café" as they are supposed to be written in
> >> English (yes, I am British, old-fashioned, and pretend to be prejudice
> >> against US English), along with the odd foreign word like "señor". But
> >> (for most of my work) I am not at all bothered about Chinese, or Ancient
> >> Mayan, or the capitalisation of Turkish "i". ASCII "to_lower" is not
> >> sufficient for me, but the simple Unicode case-fold function would be
> >> perfect.
> >
> > I disagree. I think that's exactly the separation: ASCII-only and
> everything
> > else. The former is used for fixed protocols that are case-insensitive
> and make
> > sense to implement in constexpr too. For everything else, you should get
> the
> > whole Unicode shebang, not a subset. Your use-case is ill-defined: I
> also lived
> > in Norway and like you, did need æ, ø and å, as well as é for my middle
> name;
> > but I also had a Polish colleague whose name had "ę". I'd be pretty
> > disappointed if an application from the Norwegian government application
> > allowed/handled my middle name (José) but not his first name (Jędrzej).
> At
> > worst, the limitation should be on the Latin script instead of Latin 1,
> but
> > that's still very open-ended.
> >
>
> I am not suggesting that I would want to write code that only supported
> ASCII and Norwegian letters - that would be going back to the "good old
> days" of 8-bit code pages. I am saying that I do not need all of
> Unicode - but I /do/ want basic UTF-8 support. I might be interested in
> case-insensitive comparisons of words, and I'd like them to work for the
> languages which are realistic for the code I write and where it is used.
> But I don't want to spend significant development and testing time on
> the details of languages that are highly unlikely to be used. I am
> primarily thinking about the languages supported by the code here - not
> languages that particular users might speak. If I am writing a program
> for use here in Norway, I would expect to have Norwegian as the program
> language. I might support English as an option too. Perhaps it will be
> successful and be used enough in other Scandinavian countries that it is
> worth making Danish and Swedish texts. There may come a point where it
> is worth doing full international text support, but for most programs,
> that point never comes.
>
> So yes, I would expect my programs to handle Polish letters in a name
> perfectly well. But I would not want to spend time and effort (in
> development or at run-time) worrying about how the specific Polish
> letters might change in capitalisation or case-folding in different
> circumstances - a single all-or-nothing reasonably accurate general
> Unicode function would be fine for a program that does not claim to
> support Polish.
>
> > There are a few other things that do more-than-ASCII case-insensitive
> > protocols. One that come to mind is IRIs - internationalised URLs -
> which do
> > full Unicode but at a specific version of it so that multiple
> implementations
> > agree on what is equivalent or not. Another are filenames on
> case-insensitive
> > filesystems like Microsoft's and Apple's - for those, I'd suggest you
> don't
> > attempt to copy their possibly-buggy behaviour and instead *ask* the
> system,
> > which immediately means you couldn't do this at constexpr time.
> >
>
> Agreed.
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>
https://www.unicode.org/Public/16.0.0/ucd/, specifically CaseFolding.txt
for case folding. It's all premade.
CaseFolding.txt is a 87 KB text file, most of which is comments. It's about
1654 lines, so assuming two UTF-32 characters a line that's a little under
13 KiB.
Of course, this includes complex case mappings of one codepoint to multiple
codepoints. If we drop those, we can make the table a bit smaller.
On Tue, Jul 8, 2025, 12:37 p.m. David Brown via Std-Proposals <
std-proposals_at_[hidden]> wrote:
>
>
> On 08/07/2025 17:23, Thiago Macieira via Std-Proposals wrote:
> > On Tuesday, 8 July 2025 01:24:37 Pacific Daylight Time David Brown via
> Std-
> > Proposals wrote:
> >> OK, that makes it less of an issue. But people would still have to
> >> update the C++ toolchain for new Unicode characters, rather than just
> >> pulling the latest versions of the Unicode functions from the github
> >> repository.
> >
> > If there's such a thing as a database you can download from GitHub or
> > elsewhere, then the functionality doesn't need to be in the Standard nor
> > should it. It should be a simple library that one *can* update at will.
> >
>
> There is <https://github.com/unicode-org>, but I have not looked at what
> is on it. I would guess that the tables needed for doing case folding
> are there somewhere, as well as tables for character classification.
>
> > The challenge is having those large tables in constexpr form and
> ensuring that
> > they run in reasonable number of steps so people don't have to add
> command-
> > line switches to their builds.
>
> Sure. As I understand it (and this is getting significantly beyond the
> details I am sure of), simple Unicode case-folding can be done character
> by character. Basically, you'll need a big array of type char32_t,
> indexed by the character codes. Handling such a thing sounds like an
> ideal use for C++26 #embed. But you would need the tables in an
> appropriate form first, or it would quickly become inefficient at
> compile time. (If you tell me this is impractical or impossible, I'll
> take your word for it - this is getting quite speculative on my part.)
>
> >
> >> I think the key users here are those who are okay with half-baked
> >> solutions but need more than ASCII - those targeting just one language
> >> or script, but not plain ASCII English. It is quite easy to imagine
> >> that the programming world is divided into ASCII-only US English-only
> >> and full multi-lingual multi-script international code. The reality is
> >> that for a lot programming around the world, code is single-language
> >> single-script single character encoding, but that language is not
> >> English and that character encoding is UTF-8 and not just plain ASCII.
> >>
> >> I live in Norway. So for much of what I write, I want support for the
> >> Norwegian characters Æ, Ø and Å, and their lower-case æ, ø and å. I
> >> also want support for characters in English that are used occasionally -
> >> I want to write "naïve" and "café" as they are supposed to be written in
> >> English (yes, I am British, old-fashioned, and pretend to be prejudice
> >> against US English), along with the odd foreign word like "señor". But
> >> (for most of my work) I am not at all bothered about Chinese, or Ancient
> >> Mayan, or the capitalisation of Turkish "i". ASCII "to_lower" is not
> >> sufficient for me, but the simple Unicode case-fold function would be
> >> perfect.
> >
> > I disagree. I think that's exactly the separation: ASCII-only and
> everything
> > else. The former is used for fixed protocols that are case-insensitive
> and make
> > sense to implement in constexpr too. For everything else, you should get
> the
> > whole Unicode shebang, not a subset. Your use-case is ill-defined: I
> also lived
> > in Norway and like you, did need æ, ø and å, as well as é for my middle
> name;
> > but I also had a Polish colleague whose name had "ę". I'd be pretty
> > disappointed if an application from the Norwegian government application
> > allowed/handled my middle name (José) but not his first name (Jędrzej).
> At
> > worst, the limitation should be on the Latin script instead of Latin 1,
> but
> > that's still very open-ended.
> >
>
> I am not suggesting that I would want to write code that only supported
> ASCII and Norwegian letters - that would be going back to the "good old
> days" of 8-bit code pages. I am saying that I do not need all of
> Unicode - but I /do/ want basic UTF-8 support. I might be interested in
> case-insensitive comparisons of words, and I'd like them to work for the
> languages which are realistic for the code I write and where it is used.
> But I don't want to spend significant development and testing time on
> the details of languages that are highly unlikely to be used. I am
> primarily thinking about the languages supported by the code here - not
> languages that particular users might speak. If I am writing a program
> for use here in Norway, I would expect to have Norwegian as the program
> language. I might support English as an option too. Perhaps it will be
> successful and be used enough in other Scandinavian countries that it is
> worth making Danish and Swedish texts. There may come a point where it
> is worth doing full international text support, but for most programs,
> that point never comes.
>
> So yes, I would expect my programs to handle Polish letters in a name
> perfectly well. But I would not want to spend time and effort (in
> development or at run-time) worrying about how the specific Polish
> letters might change in capitalisation or case-folding in different
> circumstances - a single all-or-nothing reasonably accurate general
> Unicode function would be fine for a program that does not claim to
> support Polish.
>
> > There are a few other things that do more-than-ASCII case-insensitive
> > protocols. One that come to mind is IRIs - internationalised URLs -
> which do
> > full Unicode but at a specific version of it so that multiple
> implementations
> > agree on what is equivalent or not. Another are filenames on
> case-insensitive
> > filesystems like Microsoft's and Apple's - for those, I'd suggest you
> don't
> > attempt to copy their possibly-buggy behaviour and instead *ask* the
> system,
> > which immediately means you couldn't do this at constexpr time.
> >
>
> Agreed.
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>
Received on 2025-07-08 16:49:46