ISOCPP sg16 List: Re: Undated reference to Unicode Standard and UAX #29

From: Jens Maurer <jens.maurer_at_[hidden]>
Date: Sun, 7 Jan 2024 09:19:10 +0100

On 07/01/2024 03.14, Tom Honermann wrote:
> The code points that can be specified via /universal-character-name <http://eel.is/c++draft/lex.charset#nt:universal-character-name>/ don't change, but additional names may become available for use in /named-universal-character <http://eel.is/c++draft/lex.charset#nt:named-universal-character>/.

That's a technically incorrect statement, because /universal-character-name/
includes /named-universal-character/ per the grammar.

The set of code points that can be specified via hex digits doesn't change
depending on the Unicode version; agreed.

> It is a fact that parts of the Unicode Standard will necessarily change as a byproduct of continually adding and improving support for the evolving collection of human languages. While we can choose to evolve C++ in some lockstep form with the Unicode Standard, users will nevertheless be exposed to differences in behavior at some point. It is far from clear to me that implementors and programmers benefit by having those changes happen at discrete points.

For any other feature added to C++, we have expressly bought in to a model where
such evolution (and exposure of differences) happens at discrete points, namely
when a new C++ revision is released every three years.

Why are features added to Unicode any different, conceptually?

> From an implementation perspective, having C++23 mode use one Unicode version and C++26 mode use another version seems problematic, at least for implementations that don't provide distinct standard library implementations for each standard mode (as is the case for all major implementors).

We've heard another implementer claim otherwise.
#ifdef's in standard library implementations triggering on the desired
standard mode seem quite common.

> As a user, I would like and expect newer compiler versions to provide support for newer Unicode versions independent of whatever standard mode I happen to compile my code with.

I disagree, from a user perspective.

As a user, I foremost want portability: A program working with compiler X claiming
conformance to C++ZZ should work unchanged on a different compiler Y also claiming
conformance to C++ZZ. That portability argument is the only reason we have WG21
to start with. If compiler X gives me newer Unicode than compiler Y, I may have
used newer named-universal-characters or relied on newer Unicode algorithm behavior
when developing my program, just to see it break down when moving to compiler Y
that hasn't gotten around to upgrading to the new Unicode version, yet.

That's bad, and in my view much worse than having the users of compiler X wait
three years until they get the new feature. Again, compiler vendors have options
to offer post-standard features to their audience if they so choose; everybody
opting in to such options is aware that their code might be non-portable.

> ABI concerns are just as relevant for minor compiler upgrades as it is for major upgrades these days. Going forward, we should strive to ensure that Unicode features that don't have a strong stability policy are adequately hidden behind an ABI boundary. I don't recall having discussed use of the grapheme breaking algorithm in std::format from an ABI perspective.

That applies regardless of release cadence of changed Unicode features,
but is more of a pain point with mid-term Unicode updates. C++ standard
versions are susceptible to ABI breaks anyway, as much as we sometimes
strive to avoid them.

> I think it makes sense to specify a minimum Unicode version for each C++ standard and I would not be opposed to adding such specification. However, it is possible that the choice of Unicode version might not always remain a choice that implementors make. As we add additional Unicode features to the C++ standard, implementors might find it desirable to rely on system provided Unicode services (e.g., by an OS provided build of ICU), at least for some features. I think we might be best off having the choice of Unicode version be implementation-defined and use of a recent version a QoI matter.

That option feels at odds with how normative references work in the formal ISO world.
Please read the intro text in [intro.refs]; I'm not seeing liberty to have
a normative r
> The real question is whether Unicode behavior will differ for -std=c++23 mode for gcc 14.1 vs gcc 19.1. I sure hope that it would!

And I sure hope it doesn't, given the discussion we've had so far.
(This sentiment is quite strong at this point.)

Jens

Received on 2024-01-07 08:19:16