Date: Tue, 08 Jul 2025 08:23:40 -0700
On Tuesday, 8 July 2025 01:24:37 Pacific Daylight Time David Brown via Std-
Proposals wrote:
> OK, that makes it less of an issue. But people would still have to
> update the C++ toolchain for new Unicode characters, rather than just
> pulling the latest versions of the Unicode functions from the github
> repository.
If there's such a thing as a database you can download from GitHub or
elsewhere, then the functionality doesn't need to be in the Standard nor
should it. It should be a simple library that one *can* update at will.
The challenge is having those large tables in constexpr form and ensuring that
they run in reasonable number of steps so people don't have to add command-
line switches to their builds.
> I think the key users here are those who are okay with half-baked
> solutions but need more than ASCII - those targeting just one language
> or script, but not plain ASCII English. It is quite easy to imagine
> that the programming world is divided into ASCII-only US English-only
> and full multi-lingual multi-script international code. The reality is
> that for a lot programming around the world, code is single-language
> single-script single character encoding, but that language is not
> English and that character encoding is UTF-8 and not just plain ASCII.
>
> I live in Norway. So for much of what I write, I want support for the
> Norwegian characters Æ, Ø and Å, and their lower-case æ, ø and å. I
> also want support for characters in English that are used occasionally -
> I want to write "naïve" and "café" as they are supposed to be written in
> English (yes, I am British, old-fashioned, and pretend to be prejudice
> against US English), along with the odd foreign word like "señor". But
> (for most of my work) I am not at all bothered about Chinese, or Ancient
> Mayan, or the capitalisation of Turkish "i". ASCII "to_lower" is not
> sufficient for me, but the simple Unicode case-fold function would be
> perfect.
I disagree. I think that's exactly the separation: ASCII-only and everything
else. The former is used for fixed protocols that are case-insensitive and make
sense to implement in constexpr too. For everything else, you should get the
whole Unicode shebang, not a subset. Your use-case is ill-defined: I also lived
in Norway and like you, did need æ, ø and å, as well as é for my middle name;
but I also had a Polish colleague whose name had "ę". I'd be pretty
disappointed if an application from the Norwegian government application
allowed/handled my middle name (José) but not his first name (Jędrzej). At
worst, the limitation should be on the Latin script instead of Latin 1, but
that's still very open-ended.
There are a few other things that do more-than-ASCII case-insensitive
protocols. One that come to mind is IRIs - internationalised URLs - which do
full Unicode but at a specific version of it so that multiple implementations
agree on what is equivalent or not. Another are filenames on case-insensitive
filesystems like Microsoft's and Apple's - for those, I'd suggest you don't
attempt to copy their possibly-buggy behaviour and instead *ask* the system,
which immediately means you couldn't do this at constexpr time.
Proposals wrote:
> OK, that makes it less of an issue. But people would still have to
> update the C++ toolchain for new Unicode characters, rather than just
> pulling the latest versions of the Unicode functions from the github
> repository.
If there's such a thing as a database you can download from GitHub or
elsewhere, then the functionality doesn't need to be in the Standard nor
should it. It should be a simple library that one *can* update at will.
The challenge is having those large tables in constexpr form and ensuring that
they run in reasonable number of steps so people don't have to add command-
line switches to their builds.
> I think the key users here are those who are okay with half-baked
> solutions but need more than ASCII - those targeting just one language
> or script, but not plain ASCII English. It is quite easy to imagine
> that the programming world is divided into ASCII-only US English-only
> and full multi-lingual multi-script international code. The reality is
> that for a lot programming around the world, code is single-language
> single-script single character encoding, but that language is not
> English and that character encoding is UTF-8 and not just plain ASCII.
>
> I live in Norway. So for much of what I write, I want support for the
> Norwegian characters Æ, Ø and Å, and their lower-case æ, ø and å. I
> also want support for characters in English that are used occasionally -
> I want to write "naïve" and "café" as they are supposed to be written in
> English (yes, I am British, old-fashioned, and pretend to be prejudice
> against US English), along with the odd foreign word like "señor". But
> (for most of my work) I am not at all bothered about Chinese, or Ancient
> Mayan, or the capitalisation of Turkish "i". ASCII "to_lower" is not
> sufficient for me, but the simple Unicode case-fold function would be
> perfect.
I disagree. I think that's exactly the separation: ASCII-only and everything
else. The former is used for fixed protocols that are case-insensitive and make
sense to implement in constexpr too. For everything else, you should get the
whole Unicode shebang, not a subset. Your use-case is ill-defined: I also lived
in Norway and like you, did need æ, ø and å, as well as é for my middle name;
but I also had a Polish colleague whose name had "ę". I'd be pretty
disappointed if an application from the Norwegian government application
allowed/handled my middle name (José) but not his first name (Jędrzej). At
worst, the limitation should be on the Latin script instead of Latin 1, but
that's still very open-ended.
There are a few other things that do more-than-ASCII case-insensitive
protocols. One that come to mind is IRIs - internationalised URLs - which do
full Unicode but at a specific version of it so that multiple implementations
agree on what is equivalent or not. Another are filenames on case-insensitive
filesystems like Microsoft's and Apple's - for those, I'd suggest you don't
attempt to copy their possibly-buggy behaviour and instead *ask* the system,
which immediately means you couldn't do this at constexpr time.
-- Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org Principal Engineer - Intel Platform & System Engineering
Received on 2025-07-08 15:23:42