ISOCPP sg16 List: Re: Undated reference to Unicode Standard and UAX #29

From: Tom Honermann <tom_at_[hidden]>
Date: Sat, 6 Jan 2024 21:14:33 -0500

On 1/6/24 2:23 PM, Jonathan Wakely via SG16 wrote:
>
>
> On Sat, 6 Jan 2024, 17:37 Jens Maurer, <jens.maurer_at_[hidden]> wrote:
>
>
>
> On 06/01/2024 00.40, Jonathan Wakely wrote:
> >
> >
> > On Fri, 5 Jan 2024 at 20:46, Jens Maurer <jens.maurer_at_[hidden]
> <mailto:jens.maurer_at_[hidden]>> wrote:
> >
> >
> >
> > On 05/01/2024 18.35, Jonathan Wakely via SG16 wrote:
> > >
> > >
> > > On Fri, 5 Jan 2024, 16:47 Mark de Wever, <koraq_at_[hidden]
> <mailto:koraq_at_[hidden]> <mailto:koraq_at_[hidden]
> <mailto:koraq_at_[hidden]>>> wrote:
> > >
> > > On Fri, Jan 05, 2024 at 04:26:49PM +0000, Jonathan
> Wakely via SG16 wrote:
> > > > Since the adoption of P2736 C++23 and the current
> C++ working draft just
> > > > refer to "the Unicode Standard", with a URL
> referring to the latest
> > > > version. We removed the bibliography entry for TR29
> revision 35. P2736
> > > > gives the justification for this that the revision
> of #29 included in
> > > > Unicode 15 (revision 41) is just a bug fix, so
> there's no problem referring
> > > > to that instead.
> > > >
> > > > That might have been true last year, but the current
> Unicode Standard
> > > > (15.1.0) includes revision 43 of UAX #29, which
> makes significant changes
> > > > to the extended grapheme cluster breaking rules. A
> new state machine is
> > > > needed (and new lookup tables of properties) to
> implement rule GB9c. That's
> > > > not just a bug fix, is it?
> > > >
> > > > Are C++ implementations expected to implement rule
> GB9c, despite it not
> > > > being part of the standard when C++23 was published?
> > >
> > > AFAIK this was indeed intended. The Unicode Standard
> moves at a faster
> > > pace than the C++ Standard. This allows C++ to always
> use the latest
> > > Unicode features and backport them to older language
> versions.
> > >
> > >
> > > Maybe the intent was to allow that, but the way I read it
> we *require* that. Is there wording that says that an
> implementation can choose which version to conform to?
> > >
> > > If not, what stops all existing implementations become
> non-conforming when a new version of unicode gets published?
> >
> > Nothing, if the new version of Unicode changes behavior that C++
> > refers to (as seems to be the case here).
> >
> > My understanding is that this was intentional; ISO wants us
> to refer
> > to undated standard if possible, too.
> >
> > If we feel we should "freeze" the Unicode version for each
> C++ standard
> > release, we could do that. Implementer feedback is
> certainly welcome
> > for that decision.
> >
> >
> > I think I'd prefer if we just somehow say that implementations
> can define which Unicode standard they conform to. That way if a
> conforming C++23 implementation uses Unicode 15.1.0 (the latest
> version today) then it doesn't become non-conforming overnight
> when a new Unicode standard is published. We can recommend that
> implementations pin themselves to a recent Unicode standard, and
> even recommend that implementations should (if possible) update to
> use newer Unicode standards as they become available.
>
> Hm... That's not how normative references are supposed to work in
> an ISO world,
> I think ("pick the version you want" -- no), but we could
> certainly try that.
>
>
> I'd be fine with "C++23 refers to unicode 15.0.0", or "it is
> implementation defined which unicode standard a C++23 implementation
> conforms to", but I don't like the idea of C++23 being a moving target
> that changes meaning after publication.
>
> How do I even know which code points I can refer to with a
> universal-character-name in a portable C++23 program? Doesn't that
> depend on the unicode version?

The code points that can be specified via /universal-character-name
<http://eel.is/c++draft/lex.charset#nt:universal-character-name>/ don't
change, but additional names may become available for use in
/named-universal-character
<http://eel.is/c++draft/lex.charset#nt:named-universal-character>/. The
Unicode stability policy ensures that such names never go away (even
when erroneously specified). See
https://www.unicode.org/policies/stability_policy.html#Name.

>
>
>
> > But there's no way that a discontinued/EOL compiler version can
> get updated to a newer Unicode standard, which is what we seem to
> be requiring as a condition of being a conforming implementation.
>
> I don't think this problem arises in practice. Do we have a
> conforming implementation
> of C++ (which happens to be C++20 at this point in time)? This
> will stop being conforming
> in a few weeks when C++23 is published, at which point C++20 is
> considered withdrawn /
> superseded. And when C++23 is published, it will stay in force
> for about three years.
>
>
> But compilers still offer support for previous standards. We don't say
> "sorry, C++23 is out, you can't use -std=c++17 now".
Wouldn't that be nice though :)
>
> Should I interpret "C++23 requires you to use the latest unicode
> standard" as only being true until 2026? That makes it tempting to not
> even try to conform to C++23 until 2026, when it stops being a moving
> target ;-)
>
> More seriously, I think what you're saying is that an implementation's
> "C++20 mode" is already a non-standard thing that has impl-defined
> meaning, because the standard only defines one version of C++ at a
> time. So an implementation can choose what its "C++20 mode" means, and
> pinning it to a version of unicode that was current in 2020 is OK.
>
> But I still find it unsettling that the definition of "C++" will
> change under our feet between 2023 and 2026. It effectively means that
> everything the unicode consortium does is immediately adopted as a DR
> against the current C++ standard with no involvement from WG21.

It is a fact that parts of the Unicode Standard will necessarily change
as a byproduct of continually adding and improving support for the
evolving collection of human languages. While we can choose to evolve
C++ in some lockstep form with the Unicode Standard, users will
nevertheless be exposed to differences in behavior at some point. It is
far from clear to me that implementors and programmers benefit by having
those changes happen at discrete points.

From an implementation perspective, having C++23 mode use one Unicode
version and C++26 mode use another version seems problematic, at least
for implementations that don't provide distinct standard library
implementations for each standard mode (as is the case for all major
implementors).

As a user, I would like and expect newer compiler versions to provide
support for newer Unicode versions independent of whatever standard mode
I happen to compile my code with.

ABI concerns are just as relevant for minor compiler upgrades as it is
for major upgrades these days. Going forward, we should strive to ensure
that Unicode features that don't have a strong stability policy are
adequately hidden behind an ABI boundary. I don't recall having
discussed use of the grapheme breaking algorithm in std::format from an
ABI perspective.

I think it makes sense to specify a minimum Unicode version for each C++
standard and I would not be opposed to adding such specification.
However, it is possible that the choice of Unicode version might not
always remain a choice that implementors make. As we add additional
Unicode features to the C++ standard, implementors might find it
desirable to rely on system provided Unicode services (e.g., by an OS
provided build of ICU), at least for some features. I think we might be
best off having the choice of Unicode version be implementation-defined
and use of a recent version a QoI matter.

>
>
> Is there a conforming impplementation of C++23 already?
>
>
> Are you suggesting that because an implementation doesn't conform 100%
> to the standard yet, that it doesn't matter if remaining conforming is
> difficult/impractical?
>
> That feels like "until you conform, you don't get to complain that
> it's hard to conform" :-)
>
>
> Are compiler versions EOL'd in three years? At least for gcc,
> that doesn't seem to be
> the case.
>
>
> Yes, it's just over 3 years of upstream support and fixes for each GCC
> release. GCC 10.1 was released 2020-05 and then went EOL with 10.5 in
> 2023-07. GCC 11 was released 2021 and will be EOL late this year. But
> a close-to-EOL release is not going to receive major updates to make
> it use a new unicode standard. In practice, I'm probably not going to
> make such changes to a stable release branch at all. Once GCC 14.1 is
> released in a few months, it might stick with unicode 15.1.0 for its
> three year lifespan. So the window for making updates to a shipping
> release is smaller than 3 years.
>
> Some vendors continue to support EOL releases past the end of upstream
> support (e.g. in an enterprise distro like RHEL). But they're unlikely
> to make significant code changes, like updating to use a new unicode
> standard.

I agree this is what will happen in practice. However, it seems like a
tangent. The real question is whether Unicode behavior will differ for
-std=c++23 mode for gcc 14.1 vs gcc 19.1. I sure hope that it would!

Tom.

Received on 2024-01-07 02:14:36