On Tue, Jul 8, 2025 at 1:55 PM Thiago Macieira via Std-Proposals
<std-proposals@lists.isocpp.org> wrote:
>
> On Tuesday, 8 July 2025 09:49:33 Pacific Daylight Time JJ Marr via Std-
> Proposals wrote:
> > CaseFolding.txt is a 87 KB text file, most of which is comments. It's about
> > 1654 lines, so assuming two UTF-32 characters a line that's a little under
> > 13 KiB.
>
> They also appear in ranges with predictable changes, like adding or
> subtracting 0x20. That means the codegen can be significantly better than one
> 13 kB table.
>
> > Of course, this includes complex case mappings of one codepoint to multiple
> > codepoints. If we drop those, we can make the table a bit smaller.
>
> You can't drop them. Case-mapping is a string operation.
But *simple* case folding is not. The term "simple case folding" is a
specific Unicode-defined subset of general case folding that is
locale-independent and only provides 1:1 mapping of codepoints.
Though not 1:1 mapping of any particular *encoding* of codepoints. A
UTF-8 string after simple case folding may not be the same encoded
length as one before.
--
Std-Proposals mailing list
Std-Proposals@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals