Date: Tue, 8 Jul 2025 10:24:37 +0200
On 08/07/2025 09:51, Jan Schultke wrote:
>> As I have said in other posts, I think it is clear that you cannot have
>> a locale-independent "toLower" or "toUpper" other than for very
>> restricted cases (basically, plain ASCII).
>
> This seems to have gotten lost in the conversion; I have a proposal
> that proposes exactly that:
> https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p3688r0.html
>
Yes, I saw that.
>> But I don't think it should
>> be infeasible to implement the standard Unicode case-folding functions
>> as constexpr functions in the C++ standard. The key argument (and I
>> think perhaps an overriding argument) against it, as I see it, would be
>> the changeability of the functions - as more characters are added to
>> Unicode, the Unicode folk will update their github with new tables and
>> functions, and C++ programmers should not have to wait for a new C++
>> standard version to be implemented before updating their code.
>
> This isn't really a problem right now because the C++ standard has a
> floating dependency on the Unicode standard, so implementers are free
> to pull in the latest changes regardless of the C++ standard.
OK, that makes it less of an issue. But people would still have to
update the C++ toolchain for new Unicode characters, rather than just
pulling the latest versions of the Unicode functions from the github
repository. However, I suppose it would not affect many people in
practice - not many programs will have to support scripts that are not
yet in Unicode, such as Tengwar or Klingon, and the never-ending march
of emojies don't have different cases.
>
>> Certainly it should be clear that such a code-folding function is a
>> Unicode code-folding function, /not/ the correct way to compare strings
>> in a given language or locale. Full string comparison is not only
>> locale dependent, but also context dependent - sometimes names will be
>> compared in a slightly different manner from other words, for example.
>> As has been said by others, for big pieces of software that take this
>> seriously, you need dedicated and specialist developers - not a couple
>> of functions in the standard library.
>
> Agreed, it's honestly too complex and domain-specific to be in the C++
> standard. This is more within the scope of the ICU, for those who need
> fully-fledged Unicode case transformations/comparisons.
>
>> On the other hand, there can be case where you, as the developer,
>> control the use of language - the Unicode simple case-folding system
>> could be good enough and provide a consistent, efficient and constexpr
>> solution that is independent of any locale settings on the host system.
>
> The people who don't care enough about the details can probably get
> away with ASCII-only support, and the people who need full Unicode
> support may not be okay with "half-baked" solutions.
>
> Someone would need to do the research and figure out what is done in
> practice. Any software already using the ICU would see very little
> benefit from a few standard functions that have 1% of its
> functionality. To be fair, people use CSS case transformations on
> websites all the time, and those also just use the locale-independent
> case mappings.
I think the key users here are those who are okay with half-baked
solutions but need more than ASCII - those targeting just one language
or script, but not plain ASCII English. It is quite easy to imagine
that the programming world is divided into ASCII-only US English-only
and full multi-lingual multi-script international code. The reality is
that for a lot programming around the world, code is single-language
single-script single character encoding, but that language is not
English and that character encoding is UTF-8 and not just plain ASCII.
I live in Norway. So for much of what I write, I want support for the
Norwegian characters Æ, Ø and Å, and their lower-case æ, ø and å. I
also want support for characters in English that are used occasionally -
I want to write "naïve" and "café" as they are supposed to be written in
English (yes, I am British, old-fashioned, and pretend to be prejudice
against US English), along with the odd foreign word like "señor". But
(for most of my work) I am not at all bothered about Chinese, or Ancient
Mayan, or the capitalisation of Turkish "i". ASCII "to_lower" is not
sufficient for me, but the simple Unicode case-fold function would be
perfect.
Of course people writing seriously international code will need serious
internationalisation libraries. But I believe (without having done the
research that I agree would be useful) that there is a vast amount of
software written that is neither significantly multi-lingual, nor plain
ASCII English.
David
>> As I have said in other posts, I think it is clear that you cannot have
>> a locale-independent "toLower" or "toUpper" other than for very
>> restricted cases (basically, plain ASCII).
>
> This seems to have gotten lost in the conversion; I have a proposal
> that proposes exactly that:
> https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p3688r0.html
>
Yes, I saw that.
>> But I don't think it should
>> be infeasible to implement the standard Unicode case-folding functions
>> as constexpr functions in the C++ standard. The key argument (and I
>> think perhaps an overriding argument) against it, as I see it, would be
>> the changeability of the functions - as more characters are added to
>> Unicode, the Unicode folk will update their github with new tables and
>> functions, and C++ programmers should not have to wait for a new C++
>> standard version to be implemented before updating their code.
>
> This isn't really a problem right now because the C++ standard has a
> floating dependency on the Unicode standard, so implementers are free
> to pull in the latest changes regardless of the C++ standard.
OK, that makes it less of an issue. But people would still have to
update the C++ toolchain for new Unicode characters, rather than just
pulling the latest versions of the Unicode functions from the github
repository. However, I suppose it would not affect many people in
practice - not many programs will have to support scripts that are not
yet in Unicode, such as Tengwar or Klingon, and the never-ending march
of emojies don't have different cases.
>
>> Certainly it should be clear that such a code-folding function is a
>> Unicode code-folding function, /not/ the correct way to compare strings
>> in a given language or locale. Full string comparison is not only
>> locale dependent, but also context dependent - sometimes names will be
>> compared in a slightly different manner from other words, for example.
>> As has been said by others, for big pieces of software that take this
>> seriously, you need dedicated and specialist developers - not a couple
>> of functions in the standard library.
>
> Agreed, it's honestly too complex and domain-specific to be in the C++
> standard. This is more within the scope of the ICU, for those who need
> fully-fledged Unicode case transformations/comparisons.
>
>> On the other hand, there can be case where you, as the developer,
>> control the use of language - the Unicode simple case-folding system
>> could be good enough and provide a consistent, efficient and constexpr
>> solution that is independent of any locale settings on the host system.
>
> The people who don't care enough about the details can probably get
> away with ASCII-only support, and the people who need full Unicode
> support may not be okay with "half-baked" solutions.
>
> Someone would need to do the research and figure out what is done in
> practice. Any software already using the ICU would see very little
> benefit from a few standard functions that have 1% of its
> functionality. To be fair, people use CSS case transformations on
> websites all the time, and those also just use the locale-independent
> case mappings.
I think the key users here are those who are okay with half-baked
solutions but need more than ASCII - those targeting just one language
or script, but not plain ASCII English. It is quite easy to imagine
that the programming world is divided into ASCII-only US English-only
and full multi-lingual multi-script international code. The reality is
that for a lot programming around the world, code is single-language
single-script single character encoding, but that language is not
English and that character encoding is UTF-8 and not just plain ASCII.
I live in Norway. So for much of what I write, I want support for the
Norwegian characters Æ, Ø and Å, and their lower-case æ, ø and å. I
also want support for characters in English that are used occasionally -
I want to write "naïve" and "café" as they are supposed to be written in
English (yes, I am British, old-fashioned, and pretend to be prejudice
against US English), along with the odd foreign word like "señor". But
(for most of my work) I am not at all bothered about Chinese, or Ancient
Mayan, or the capitalisation of Turkish "i". ASCII "to_lower" is not
sufficient for me, but the simple Unicode case-fold function would be
perfect.
Of course people writing seriously international code will need serious
internationalisation libraries. But I believe (without having done the
research that I agree would be useful) that there is a vast amount of
software written that is neither significantly multi-lingual, nor plain
ASCII English.
David
Received on 2025-07-08 08:24:43