Date: Tue, 8 Jul 2025 18:37:10 +0200
On 08/07/2025 17:23, Thiago Macieira via Std-Proposals wrote:
> On Tuesday, 8 July 2025 01:24:37 Pacific Daylight Time David Brown via Std-
> Proposals wrote:
>> OK, that makes it less of an issue. But people would still have to
>> update the C++ toolchain for new Unicode characters, rather than just
>> pulling the latest versions of the Unicode functions from the github
>> repository.
>
> If there's such a thing as a database you can download from GitHub or
> elsewhere, then the functionality doesn't need to be in the Standard nor
> should it. It should be a simple library that one *can* update at will.
>
There is <https://github.com/unicode-org>, but I have not looked at what
is on it. I would guess that the tables needed for doing case folding
are there somewhere, as well as tables for character classification.
> The challenge is having those large tables in constexpr form and ensuring that
> they run in reasonable number of steps so people don't have to add command-
> line switches to their builds.
Sure. As I understand it (and this is getting significantly beyond the
details I am sure of), simple Unicode case-folding can be done character
by character. Basically, you'll need a big array of type char32_t,
indexed by the character codes. Handling such a thing sounds like an
ideal use for C++26 #embed. But you would need the tables in an
appropriate form first, or it would quickly become inefficient at
compile time. (If you tell me this is impractical or impossible, I'll
take your word for it - this is getting quite speculative on my part.)
>
>> I think the key users here are those who are okay with half-baked
>> solutions but need more than ASCII - those targeting just one language
>> or script, but not plain ASCII English. It is quite easy to imagine
>> that the programming world is divided into ASCII-only US English-only
>> and full multi-lingual multi-script international code. The reality is
>> that for a lot programming around the world, code is single-language
>> single-script single character encoding, but that language is not
>> English and that character encoding is UTF-8 and not just plain ASCII.
>>
>> I live in Norway. So for much of what I write, I want support for the
>> Norwegian characters Æ, Ø and Å, and their lower-case æ, ø and å. I
>> also want support for characters in English that are used occasionally -
>> I want to write "naïve" and "café" as they are supposed to be written in
>> English (yes, I am British, old-fashioned, and pretend to be prejudice
>> against US English), along with the odd foreign word like "señor". But
>> (for most of my work) I am not at all bothered about Chinese, or Ancient
>> Mayan, or the capitalisation of Turkish "i". ASCII "to_lower" is not
>> sufficient for me, but the simple Unicode case-fold function would be
>> perfect.
>
> I disagree. I think that's exactly the separation: ASCII-only and everything
> else. The former is used for fixed protocols that are case-insensitive and make
> sense to implement in constexpr too. For everything else, you should get the
> whole Unicode shebang, not a subset. Your use-case is ill-defined: I also lived
> in Norway and like you, did need æ, ø and å, as well as é for my middle name;
> but I also had a Polish colleague whose name had "ę". I'd be pretty
> disappointed if an application from the Norwegian government application
> allowed/handled my middle name (José) but not his first name (Jędrzej). At
> worst, the limitation should be on the Latin script instead of Latin 1, but
> that's still very open-ended.
>
I am not suggesting that I would want to write code that only supported
ASCII and Norwegian letters - that would be going back to the "good old
days" of 8-bit code pages. I am saying that I do not need all of
Unicode - but I /do/ want basic UTF-8 support. I might be interested in
case-insensitive comparisons of words, and I'd like them to work for the
languages which are realistic for the code I write and where it is used.
But I don't want to spend significant development and testing time on
the details of languages that are highly unlikely to be used. I am
primarily thinking about the languages supported by the code here - not
languages that particular users might speak. If I am writing a program
for use here in Norway, I would expect to have Norwegian as the program
language. I might support English as an option too. Perhaps it will be
successful and be used enough in other Scandinavian countries that it is
worth making Danish and Swedish texts. There may come a point where it
is worth doing full international text support, but for most programs,
that point never comes.
So yes, I would expect my programs to handle Polish letters in a name
perfectly well. But I would not want to spend time and effort (in
development or at run-time) worrying about how the specific Polish
letters might change in capitalisation or case-folding in different
circumstances - a single all-or-nothing reasonably accurate general
Unicode function would be fine for a program that does not claim to
support Polish.
> There are a few other things that do more-than-ASCII case-insensitive
> protocols. One that come to mind is IRIs - internationalised URLs - which do
> full Unicode but at a specific version of it so that multiple implementations
> agree on what is equivalent or not. Another are filenames on case-insensitive
> filesystems like Microsoft's and Apple's - for those, I'd suggest you don't
> attempt to copy their possibly-buggy behaviour and instead *ask* the system,
> which immediately means you couldn't do this at constexpr time.
>
Agreed.
> On Tuesday, 8 July 2025 01:24:37 Pacific Daylight Time David Brown via Std-
> Proposals wrote:
>> OK, that makes it less of an issue. But people would still have to
>> update the C++ toolchain for new Unicode characters, rather than just
>> pulling the latest versions of the Unicode functions from the github
>> repository.
>
> If there's such a thing as a database you can download from GitHub or
> elsewhere, then the functionality doesn't need to be in the Standard nor
> should it. It should be a simple library that one *can* update at will.
>
There is <https://github.com/unicode-org>, but I have not looked at what
is on it. I would guess that the tables needed for doing case folding
are there somewhere, as well as tables for character classification.
> The challenge is having those large tables in constexpr form and ensuring that
> they run in reasonable number of steps so people don't have to add command-
> line switches to their builds.
Sure. As I understand it (and this is getting significantly beyond the
details I am sure of), simple Unicode case-folding can be done character
by character. Basically, you'll need a big array of type char32_t,
indexed by the character codes. Handling such a thing sounds like an
ideal use for C++26 #embed. But you would need the tables in an
appropriate form first, or it would quickly become inefficient at
compile time. (If you tell me this is impractical or impossible, I'll
take your word for it - this is getting quite speculative on my part.)
>
>> I think the key users here are those who are okay with half-baked
>> solutions but need more than ASCII - those targeting just one language
>> or script, but not plain ASCII English. It is quite easy to imagine
>> that the programming world is divided into ASCII-only US English-only
>> and full multi-lingual multi-script international code. The reality is
>> that for a lot programming around the world, code is single-language
>> single-script single character encoding, but that language is not
>> English and that character encoding is UTF-8 and not just plain ASCII.
>>
>> I live in Norway. So for much of what I write, I want support for the
>> Norwegian characters Æ, Ø and Å, and their lower-case æ, ø and å. I
>> also want support for characters in English that are used occasionally -
>> I want to write "naïve" and "café" as they are supposed to be written in
>> English (yes, I am British, old-fashioned, and pretend to be prejudice
>> against US English), along with the odd foreign word like "señor". But
>> (for most of my work) I am not at all bothered about Chinese, or Ancient
>> Mayan, or the capitalisation of Turkish "i". ASCII "to_lower" is not
>> sufficient for me, but the simple Unicode case-fold function would be
>> perfect.
>
> I disagree. I think that's exactly the separation: ASCII-only and everything
> else. The former is used for fixed protocols that are case-insensitive and make
> sense to implement in constexpr too. For everything else, you should get the
> whole Unicode shebang, not a subset. Your use-case is ill-defined: I also lived
> in Norway and like you, did need æ, ø and å, as well as é for my middle name;
> but I also had a Polish colleague whose name had "ę". I'd be pretty
> disappointed if an application from the Norwegian government application
> allowed/handled my middle name (José) but not his first name (Jędrzej). At
> worst, the limitation should be on the Latin script instead of Latin 1, but
> that's still very open-ended.
>
I am not suggesting that I would want to write code that only supported
ASCII and Norwegian letters - that would be going back to the "good old
days" of 8-bit code pages. I am saying that I do not need all of
Unicode - but I /do/ want basic UTF-8 support. I might be interested in
case-insensitive comparisons of words, and I'd like them to work for the
languages which are realistic for the code I write and where it is used.
But I don't want to spend significant development and testing time on
the details of languages that are highly unlikely to be used. I am
primarily thinking about the languages supported by the code here - not
languages that particular users might speak. If I am writing a program
for use here in Norway, I would expect to have Norwegian as the program
language. I might support English as an option too. Perhaps it will be
successful and be used enough in other Scandinavian countries that it is
worth making Danish and Swedish texts. There may come a point where it
is worth doing full international text support, but for most programs,
that point never comes.
So yes, I would expect my programs to handle Polish letters in a name
perfectly well. But I would not want to spend time and effort (in
development or at run-time) worrying about how the specific Polish
letters might change in capitalisation or case-folding in different
circumstances - a single all-or-nothing reasonably accurate general
Unicode function would be fine for a program that does not claim to
support Polish.
> There are a few other things that do more-than-ASCII case-insensitive
> protocols. One that come to mind is IRIs - internationalised URLs - which do
> full Unicode but at a specific version of it so that multiple implementations
> agree on what is equivalent or not. Another are filenames on case-insensitive
> filesystems like Microsoft's and Apple's - for those, I'd suggest you don't
> attempt to copy their possibly-buggy behaviour and instead *ask* the system,
> which immediately means you couldn't do this at constexpr time.
>
Agreed.
Received on 2025-07-08 16:37:17