C++ Logo

sg16

Advanced search

Re: Undated reference to Unicode Standard and UAX #29

From: Tom Honermann <tom_at_[hidden]>
Date: Sun, 7 Jan 2024 15:27:59 -0500
On 1/7/24 3:19 AM, Jens Maurer wrote:
>
> On 07/01/2024 03.14, Tom Honermann wrote:
>> The code points that can be specified via /universal-character-name<http://eel.is/c++draft/lex.charset#nt:universal-character-name>/ don't change, but additional names may become available for use in /named-universal-character<http://eel.is/c++draft/lex.charset#nt:named-universal-character>/.
> That's a technically incorrect statement, because /universal-character-name/
> includes /named-universal-character/ per the grammar.
>
> The set of code points that can be specified via hex digits doesn't change
> depending on the Unicode version; agreed.
Indeed, this is why I stated that the set of code points that can be
specified via /universal-character-name/ doesn't change.
>
>> It is a fact that parts of the Unicode Standard will necessarily change as a byproduct of continually adding and improving support for the evolving collection of human languages. While we can choose to evolve C++ in some lockstep form with the Unicode Standard, users will nevertheless be exposed to differences in behavior at some point. It is far from clear to me that implementors and programmers benefit by having those changes happen at discrete points.
> For any other feature added to C++, we have expressly bought in to a model where
> such evolution (and exposure of differences) happens at discrete points, namely
> when a new C++ revision is released every three years.
The C++ standard includes in its bibliography
<http://eel.is/c++draft/bibliography> an undated reference to the IANA
Time Zone Database <https://www.iana.org/time-zones> with a linked
reference in [time.zone.general]p1
<http://eel.is/c++draft/time#zone.general-1>. I grant that is a
non-normative reference and the use of it differs somewhat from the
situation we face with referencing the Unicode Standard, but it is an
example of specified behavior that is intended to change at points that
are not aligned with the release of C++ standard revisions.
>
> Why are features added to Unicode any different, conceptually?

It is desirable that programs written and compiled for a particular C++
standard revision be able to correctly consume text produced in
accordance with newer Unicode standards subject to limitations imposed
by the interfaces that we specify. Requiring that programmers migrate
their code to newer C++ standards in order to take advantage of
corrections in newer Unicode standards would impose an unnecessary
hindrance.

>
>> From an implementation perspective, having C++23 mode use one Unicode version and C++26 mode use another version seems problematic, at least for implementations that don't provide distinct standard library implementations for each standard mode (as is the case for all major implementors).
> We've heard another implementer claim otherwise.
I don't think I've seen such a claim.
> #ifdef's in standard library implementations triggering on the desired
> standard mode seem quite common.

They certainly are common, but they are also not without cost. Some
standard library implementations assume or require the availability of
language features from newer standard revisions in older standard modes
so as to avoid unwanted #ifdef directives.

>
>> As a user, I would like and expect newer compiler versions to provide support for newer Unicode versions independent of whatever standard mode I happen to compile my code with.
> I disagree, from a user perspective.
>
> As a user, I foremost want portability: A program working with compiler X claiming
> conformance to C++ZZ should work unchanged on a different compiler Y also claiming
> conformance to C++ZZ. That portability argument is the only reason we have WG21
> to start with. If compiler X gives me newer Unicode than compiler Y, I may have
> used newer named-universal-characters or relied on newer Unicode algorithm behavior
> when developing my program, just to see it break down when moving to compiler Y
> that hasn't gotten around to upgrading to the new Unicode version, yet.
I think these concerns are adequately addressed by specifying a minimum
Unicode version. Note that implementations are always free to accept
additional character names as a conforming extension (a diagnostic for
use of such names can be issued).
>
> That's bad, and in my view much worse than having the users of compiler X wait
> three years until they get the new feature. Again, compiler vendors have options
> to offer post-standard features to their audience if they so choose; everybody
> opting in to such options is aware that their code might be non-portable.

I think the attention placed on backward compatibility by the Unicode
Consortium suffices here; I think their efforts are at least on par with
WG21.

I view the change in behavior that spawned this email thread as more of
a bug fix than a new feature.

>
>> ABI concerns are just as relevant for minor compiler upgrades as it is for major upgrades these days. Going forward, we should strive to ensure that Unicode features that don't have a strong stability policy are adequately hidden behind an ABI boundary. I don't recall having discussed use of the grapheme breaking algorithm in std::format from an ABI perspective.
> That applies regardless of release cadence of changed Unicode features,
> but is more of a pain point with mid-term Unicode updates. C++ standard
> versions are susceptible to ABI breaks anyway, as much as we sometimes
> strive to avoid them.
Agreed.
>
>> I think it makes sense to specify a minimum Unicode version for each C++ standard and I would not be opposed to adding such specification. However, it is possible that the choice of Unicode version might not always remain a choice that implementors make. As we add additional Unicode features to the C++ standard, implementors might find it desirable to rely on system provided Unicode services (e.g., by an OS provided build of ICU), at least for some features. I think we might be best off having the choice of Unicode version be implementation-defined and use of a recent version a QoI matter.
> That option feels at odds with how normative references work in the formal ISO world.
> Please read the intro text in [intro.refs]; I'm not seeing liberty to have
> a normative r
>> The real question is whether Unicode behavior will differ for -std=c++23 mode for gcc 14.1 vs gcc 19.1. I sure hope that it would!
> And I sure hope it doesn't, given the discussion we've had so far.
> (This sentiment is quite strong at this point.)

Thank you for filing the CWG issue. There are clearly nuances and
perspectives that warrant additional discussion. I'm going to add this
topic as one of the agenda items for this week's SG16 meeting.

Tom.

>
> Jens
>

Received on 2024-01-07 20:28:01