C++ Logo

sg16

Advanced search

Re: Undated reference to Unicode Standard and UAX #29

From: Jonathan Wakely <cxx_at_[hidden]>
Date: Sun, 7 Jan 2024 07:54:39 +0000
On Sun, 7 Jan 2024, 02:14 Tom Honermann, <tom_at_[hidden]> wrote:

> On 1/6/24 2:23 PM, Jonathan Wakely via SG16 wrote:
>
>
>
> On Sat, 6 Jan 2024, 17:37 Jens Maurer, <jens.maurer_at_[hidden]> wrote:
>
>>
>>
>> On 06/01/2024 00.40, Jonathan Wakely wrote:
>> >
>> >
>> > On Fri, 5 Jan 2024 at 20:46, Jens Maurer <jens.maurer_at_[hidden] <mailto:
>> jens.maurer_at_[hidden]>> wrote:
>> >
>> >
>> >
>> > On 05/01/2024 18.35, Jonathan Wakely via SG16 wrote:
>> > >
>> > >
>> > > On Fri, 5 Jan 2024, 16:47 Mark de Wever, <koraq_at_[hidden]
>> <mailto:koraq_at_[hidden]> <mailto:koraq_at_[hidden] <mailto:koraq_at_[hidden]>>>
>> wrote:
>> > >
>> > > On Fri, Jan 05, 2024 at 04:26:49PM +0000, Jonathan Wakely via
>> SG16 wrote:
>> > > > Since the adoption of P2736 C++23 and the current C++
>> working draft just
>> > > > refer to "the Unicode Standard", with a URL referring to
>> the latest
>> > > > version. We removed the bibliography entry for TR29
>> revision 35. P2736
>> > > > gives the justification for this that the revision of #29
>> included in
>> > > > Unicode 15 (revision 41) is just a bug fix, so there's no
>> problem referring
>> > > > to that instead.
>> > > >
>> > > > That might have been true last year, but the current
>> Unicode Standard
>> > > > (15.1.0) includes revision 43 of UAX #29, which makes
>> significant changes
>> > > > to the extended grapheme cluster breaking rules. A new
>> state machine is
>> > > > needed (and new lookup tables of properties) to implement
>> rule GB9c. That's
>> > > > not just a bug fix, is it?
>> > > >
>> > > > Are C++ implementations expected to implement rule GB9c,
>> despite it not
>> > > > being part of the standard when C++23 was published?
>> > >
>> > > AFAIK this was indeed intended. The Unicode Standard moves at
>> a faster
>> > > pace than the C++ Standard. This allows C++ to always use the
>> latest
>> > > Unicode features and backport them to older language versions.
>> > >
>> > >
>> > > Maybe the intent was to allow that, but the way I read it we
>> *require* that. Is there wording that says that an implementation can
>> choose which version to conform to?
>> > >
>> > > If not, what stops all existing implementations become
>> non-conforming when a new version of unicode gets published?
>> >
>> > Nothing, if the new version of Unicode changes behavior that C++
>> > refers to (as seems to be the case here).
>> >
>> > My understanding is that this was intentional; ISO wants us to refer
>> > to undated standard if possible, too.
>> >
>> > If we feel we should "freeze" the Unicode version for each C++
>> standard
>> > release, we could do that. Implementer feedback is certainly
>> welcome
>> > for that decision.
>> >
>> >
>> > I think I'd prefer if we just somehow say that implementations can
>> define which Unicode standard they conform to. That way if a conforming
>> C++23 implementation uses Unicode 15.1.0 (the latest version today) then it
>> doesn't become non-conforming overnight when a new Unicode standard is
>> published. We can recommend that implementations pin themselves to a recent
>> Unicode standard, and even recommend that implementations should (if
>> possible) update to use newer Unicode standards as they become available.
>>
>> Hm... That's not how normative references are supposed to work in an ISO
>> world,
>> I think ("pick the version you want" -- no), but we could certainly try
>> that.
>>
>
> I'd be fine with "C++23 refers to unicode 15.0.0", or "it is
> implementation defined which unicode standard a C++23 implementation
> conforms to", but I don't like the idea of C++23 being a moving target that
> changes meaning after publication.
>
> How do I even know which code points I can refer to with a
> universal-character-name in a portable C++23 program? Doesn't that depend
> on the unicode version?
>
> The code points that can be specified via *universal-character-name
> <http://eel.is/c++draft/lex.charset#nt:universal-character-name>* don't
> change, but additional names may become available for use in *named-universal-character
> <http://eel.is/c++draft/lex.charset#nt:named-universal-character>*.
>

Oops sorry, that's what I meant. The \N{FOO} form.

The Unicode stability policy ensures that such names never go away (even
> when erroneously specified). See
> https://www.unicode.org/policies/stability_policy.html#Name.
>
>
>
>
>> > But there's no way that a discontinued/EOL compiler version can get
>> updated to a newer Unicode standard, which is what we seem to be requiring
>> as a condition of being a conforming implementation.
>>
>> I don't think this problem arises in practice. Do we have a conforming
>> implementation
>> of C++ (which happens to be C++20 at this point in time)? This will stop
>> being conforming
>> in a few weeks when C++23 is published, at which point C++20 is
>> considered withdrawn /
>> superseded. And when C++23 is published, it will stay in force for about
>> three years.
>>
>
> But compilers still offer support for previous standards. We don't say
> "sorry, C++23 is out, you can't use -std=c++17 now".
>
> Wouldn't that be nice though :)
>
>
> Should I interpret "C++23 requires you to use the latest unicode standard"
> as only being true until 2026? That makes it tempting to not even try to
> conform to C++23 until 2026, when it stops being a moving target ;-)
>
> More seriously, I think what you're saying is that an implementation's
> "C++20 mode" is already a non-standard thing that has impl-defined meaning,
> because the standard only defines one version of C++ at a time. So an
> implementation can choose what its "C++20 mode" means, and pinning it to a
> version of unicode that was current in 2020 is OK.
>
> But I still find it unsettling that the definition of "C++" will change
> under our feet between 2023 and 2026. It effectively means that everything
> the unicode consortium does is immediately adopted as a DR against the
> current C++ standard with no involvement from WG21.
>
> It is a fact that parts of the Unicode Standard will necessarily change as
> a byproduct of continually adding and improving support for the evolving
> collection of human languages. While we can choose to evolve C++ in some
> lockstep form with the Unicode Standard, users will nevertheless be exposed
> to differences in behavior at some point. It is far from clear to me that
> implementors and programmers benefit by having those changes happen at
> discrete points.
>
> From an implementation perspective, having C++23 mode use one Unicode
> version and C++26 mode use another version seems problematic, at least for
> implementations that don't provide distinct standard library
> implementations for each standard mode (as is the case for all major
> implementors).
>

Indeed.

> As a user, I would like and expect newer compiler versions to provide
> support for newer Unicode versions independent of whatever standard mode I
> happen to compile my code with.
>
Agreed, and recommending a minimum unicode version for each C++ standard
would work for that.

> ABI concerns are just as relevant for minor compiler upgrades as it is for
> major upgrades these days. Going forward, we should strive to ensure that
> Unicode features that don't have a strong stability policy are adequately
> hidden behind an ABI boundary. I don't recall having discussed use of the
> grapheme breaking algorithm in std::format from an ABI perspective.
>

In older versions of the algo from 2015 you could detect a break just by
inspecting two characters at a time. The current algo requires a state
machine, or at least some additional state to be tracked (at a minimum, a
pointer to the start of the current cluster), and hundreds of bytes of new
lookup tables.

I think it makes sense to specify a minimum Unicode version for each C++
> standard and I would not be opposed to adding such specification. However,
> it is possible that the choice of Unicode version might not always remain a
> choice that implementors make. As we add additional Unicode features to the
> C++ standard, implementors might find it desirable to rely on system
> provided Unicode services (e.g., by an OS provided build of ICU), at least
> for some features. I think we might be best off having the choice of
> Unicode version be implementation-defined and use of a recent version a QoI
> matter.
>

That sounds reasonable.

In practice, implementations are not going to always be able to use the
very latest unicode standard, so we're just setting users up for
disappointment if we say that the standard requires/guarantees it.


>
> Is there a conforming impplementation of C++23 already?
>>
>
> Are you suggesting that because an implementation doesn't conform 100% to
> the standard yet, that it doesn't matter if remaining conforming is
> difficult/impractical?
>
> That feels like "until you conform, you don't get to complain that it's
> hard to conform" :-)
>
>
> Are compiler versions EOL'd in three years? At least for gcc, that
>> doesn't seem to be
>> the case.
>>
>
> Yes, it's just over 3 years of upstream support and fixes for each GCC
> release. GCC 10.1 was released 2020-05 and then went EOL with 10.5 in
> 2023-07. GCC 11 was released 2021 and will be EOL late this year. But a
> close-to-EOL release is not going to receive major updates to make it use a
> new unicode standard. In practice, I'm probably not going to make such
> changes to a stable release branch at all. Once GCC 14.1 is released in a
> few months, it might stick with unicode 15.1.0 for its three year lifespan.
> So the window for making updates to a shipping release is smaller than 3
> years.
>
> Some vendors continue to support EOL releases past the end of upstream
> support (e.g. in an enterprise distro like RHEL). But they're unlikely to
> make significant code changes, like updating to use a new unicode standard.
>
> I agree this is what will happen in practice. However, it seems like a
> tangent. The real question is whether Unicode behavior will differ for
> -std=c++23 mode for gcc 14.1 vs gcc 19.1. I sure hope that it would!
>

I hope so too :-)

Received on 2024-01-07 07:54:55