On 1/7/24 3:19 AM, Jens Maurer wrote:


On 07/01/2024 03.14, Tom Honermann wrote:

The code points that can be specified via /universal-character-name <http://eel.is/c++draft/lex.charset#nt:universal-character-name>/ don't change, but additional names may become available for use in /named-universal-character <http://eel.is/c++draft/lex.charset#nt:named-universal-character>/.

That's a technically incorrect statement, because /universal-character-name/
includes /named-universal-character/ per the grammar.

The set of code points that can be specified via hex digits doesn't change
depending on the Unicode version; agreed.

Indeed, this is why I stated that the set of code points that can be specified via universal-character-name doesn't change.

It is a fact that parts of the Unicode Standard will necessarily change as a byproduct of continually adding and improving support for the evolving collection of human languages. While we can choose to evolve C++ in some lockstep form with the Unicode Standard, users will nevertheless be exposed to differences in behavior at some point. It is far from clear to me that implementors and programmers benefit by having those changes happen at discrete points.

For any other feature added to C++, we have expressly bought in to a model where
such evolution (and exposure of differences) happens at discrete points, namely
when a new C++ revision is released every three years.

The C++ standard includes in its bibliography an undated reference to the IANA Time Zone Database with a linked reference in [time.zone.general]p1. I grant that is a non-normative reference and the use of it differs somewhat from the situation we face with referencing the Unicode Standard, but it is an example of specified behavior that is intended to change at points that are not aligned with the release of C++ standard revisions.


Why are features added to Unicode any different, conceptually?

It is desirable that programs written and compiled for a particular C++ standard revision be able to correctly consume text produced in accordance with newer Unicode standards subject to limitations imposed by the interfaces that we specify. Requiring that programmers migrate their code to newer C++ standards in order to take advantage of corrections in newer Unicode standards would impose an unnecessary hindrance.

From an implementation perspective, having C++23 mode use one Unicode version and C++26 mode use another version seems problematic, at least for implementations that don't provide distinct standard library implementations for each standard mode (as is the case for all major implementors).

We've heard another implementer claim otherwise.

I don't think I've seen such a claim.

#ifdef's in standard library implementations triggering on the desired
standard mode seem quite common.

They certainly are common, but they are also not without cost. Some standard library implementations assume or require the availability of language features from newer standard revisions in older standard modes so as to avoid unwanted #ifdef directives.

As a user, I would like and expect newer compiler versions to provide support for newer Unicode versions independent of whatever standard mode I happen to compile my code with.

I disagree, from a user perspective.

As a user, I foremost want portability: A program working with compiler X claiming
conformance to C++ZZ should work unchanged on a different compiler Y also claiming
conformance to C++ZZ.  That portability argument is the only reason we have WG21
to start with.  If compiler X gives me newer Unicode than compiler Y, I may have
used newer named-universal-characters or relied on newer Unicode algorithm behavior
when developing my program, just to see it break down when moving to compiler Y
that hasn't gotten around to upgrading to the new Unicode version, yet.

I think these concerns are adequately addressed by specifying a minimum Unicode version. Note that implementations are always free to accept additional character names as a conforming extension (a diagnostic for use of such names can be issued).


That's bad, and in my view much worse than having the users of compiler X wait
three years until they get the new feature.  Again, compiler vendors have options
to offer post-standard features to their audience if they so choose; everybody
opting in to such options is aware that their code might be non-portable.

I think the attention placed on backward compatibility by the Unicode Consortium suffices here; I think their efforts are at least on par with WG21.

I view the change in behavior that spawned this email thread as more of a bug fix than a new feature.

ABI concerns are just as relevant for minor compiler upgrades as it is for major upgrades these days. Going forward, we should strive to ensure that Unicode features that don't have a strong stability policy are adequately hidden behind an ABI boundary. I don't recall having discussed use of the grapheme breaking algorithm in std::format from an ABI perspective.

That applies regardless of release cadence of changed Unicode features,
but is more of a pain point with mid-term Unicode updates.  C++ standard
versions are susceptible to ABI breaks anyway, as much as we sometimes
strive to avoid them.

Agreed.

I think it makes sense to specify a minimum Unicode version for each C++ standard and I would not be opposed to adding such specification. However, it is possible that the choice of Unicode version might not always remain a choice that implementors make. As we add additional Unicode features to the C++ standard, implementors might find it desirable to rely on system provided Unicode services (e.g., by an OS provided build of ICU), at least for some features. I think we might be best off having the choice of Unicode version be implementation-defined and use of a recent version a QoI matter.

That option feels at odds with how normative references work in the formal ISO world.
Please read the intro text in [intro.refs]; I'm not seeing liberty to have
a normative r

 The real question is whether Unicode behavior will differ for -std=c++23 mode for gcc 14.1 vs gcc 19.1. I sure hope that it would!

And I sure hope it doesn't, given the discussion we've had so far.
(This sentiment is quite strong at this point.)

Thank you for filing the CWG issue. There are clearly nuances and perspectives that warrant additional discussion. I'm going to add this topic as one of the agenda items for this week's SG16 meeting.

Tom.


Jens