C++ Logo

std-proposals

Advanced search

Re: [std-proposals] constexpr tolower, toupper, isalpha

From: Jan Schultke <janschultke_at_[hidden]>
Date: Tue, 8 Jul 2025 11:07:51 +0200
👍

Jan Schultke reacted via Gmail
<https://www.google.com/gmail/about/?utm_source=gmail-in-product&utm_medium=et&utm_campaign=emojireactionemail#app>

On Tue, 8 Jul 2025 at 10:24, David Brown <david.brown_at_[hidden]> wrote:

>
> On 08/07/2025 09:51, Jan Schultke wrote:
> >> As I have said in other posts, I think it is clear that you cannot have
> >> a locale-independent "toLower" or "toUpper" other than for very
> >> restricted cases (basically, plain ASCII).
> >
> > This seems to have gotten lost in the conversion; I have a proposal
> > that proposes exactly that:
> > https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p3688r0.html
> >
>
> Yes, I saw that.
>
> >> But I don't think it should
> >> be infeasible to implement the standard Unicode case-folding functions
> >> as constexpr functions in the C++ standard. The key argument (and I
> >> think perhaps an overriding argument) against it, as I see it, would be
> >> the changeability of the functions - as more characters are added to
> >> Unicode, the Unicode folk will update their github with new tables and
> >> functions, and C++ programmers should not have to wait for a new C++
> >> standard version to be implemented before updating their code.
> >
> > This isn't really a problem right now because the C++ standard has a
> > floating dependency on the Unicode standard, so implementers are free
> > to pull in the latest changes regardless of the C++ standard.
>
> OK, that makes it less of an issue. But people would still have to
> update the C++ toolchain for new Unicode characters, rather than just
> pulling the latest versions of the Unicode functions from the github
> repository. However, I suppose it would not affect many people in
> practice - not many programs will have to support scripts that are not
> yet in Unicode, such as Tengwar or Klingon, and the never-ending march
> of emojies don't have different cases.
>
> >
> >> Certainly it should be clear that such a code-folding function is a
> >> Unicode code-folding function, /not/ the correct way to compare strings
> >> in a given language or locale. Full string comparison is not only
> >> locale dependent, but also context dependent - sometimes names will be
> >> compared in a slightly different manner from other words, for example.
> >> As has been said by others, for big pieces of software that take this
> >> seriously, you need dedicated and specialist developers - not a couple
> >> of functions in the standard library.
> >
> > Agreed, it's honestly too complex and domain-specific to be in the C++
> > standard. This is more within the scope of the ICU, for those who need
> > fully-fledged Unicode case transformations/comparisons.
> >
> >> On the other hand, there can be case where you, as the developer,
> >> control the use of language - the Unicode simple case-folding system
> >> could be good enough and provide a consistent, efficient and constexpr
> >> solution that is independent of any locale settings on the host system.
> >
> > The people who don't care enough about the details can probably get
> > away with ASCII-only support, and the people who need full Unicode
> > support may not be okay with "half-baked" solutions.
> >
> > Someone would need to do the research and figure out what is done in
> > practice. Any software already using the ICU would see very little
> > benefit from a few standard functions that have 1% of its
> > functionality. To be fair, people use CSS case transformations on
> > websites all the time, and those also just use the locale-independent
> > case mappings.
>
> I think the key users here are those who are okay with half-baked
> solutions but need more than ASCII - those targeting just one language
> or script, but not plain ASCII English. It is quite easy to imagine
> that the programming world is divided into ASCII-only US English-only
> and full multi-lingual multi-script international code. The reality is
> that for a lot programming around the world, code is single-language
> single-script single character encoding, but that language is not
> English and that character encoding is UTF-8 and not just plain ASCII.
>
> I live in Norway. So for much of what I write, I want support for the
> Norwegian characters Æ, Ø and Å, and their lower-case æ, ø and å. I
> also want support for characters in English that are used occasionally -
> I want to write "naïve" and "café" as they are supposed to be written in
> English (yes, I am British, old-fashioned, and pretend to be prejudice
> against US English), along with the odd foreign word like "señor". But
> (for most of my work) I am not at all bothered about Chinese, or Ancient
> Mayan, or the capitalisation of Turkish "i". ASCII "to_lower" is not
> sufficient for me, but the simple Unicode case-fold function would be
> perfect.
>
> Of course people writing seriously international code will need serious
> internationalisation libraries. But I believe (without having done the
> research that I agree would be useful) that there is a vast amount of
> software written that is neither significantly multi-lingual, nor plain
> ASCII English.
>
>
> David
>
>
>

Received on 2025-07-08 09:08:05