ISOCPP std-proposals List: Re: [std-proposals] Formatting code points to character names

From: Jan Schultke <janschultke_at_[hidden]>
Date: Fri, 23 May 2025 10:23:57 +0200

>
> It would add about 2MB to the on-disk and in-memory footprint of every C++
> application, for something most programs will never use.
>
> The data file is publicly available, if your application needs to
> translate U+NNNN to names then it can figure out how to do that as a
> post-processing step. I don't think everybody needs this functionality.

Firstly, I think 2MB is extremely pessimistic. There are existing
implementations that take a small fraction of that like
https://godbolt.org/z/4arrY6hjv The existing implementations are still
brute-forcish in that they don't exploit much Unicode-specific knowledge.
For example, code points within the same block often have almost identical
names, and so it would seem much better to divide and conquer than to treat
all names as one big string to compress. I suspect you can get it sub 100
KiB or 50 KiB, but the burden of proof is obviously on me.

The argument of "not everybody needs this functionality" is also not
particularly strong. Not everybody needs mathematical special functions,
<simd>, executors, filesystems, <linalg>, multi-threading, parallel
algorithms, std::hive and various other rarely-used containers, and many
more features. Admittedly, not all of these contribute to binary or memory
size, I'm just pointing out that features are useful and appropriate for
standardization even if a fraction of developers use them.

This idea probably lives and dies with how small you can make that
footprint. If it's only like ~50KiB, it can be easily justified, and the
linker can optimize unused symbols out anyway in the case of static
linking. I do agree that something like 2MB would be excessive for a
feature like this.

Received on 2025-05-23 08:24:13