ISOCPP sg16 List: Re: Undated reference to Unicode Standard and UAX #29

From: Robin Leroy <eggrobin_at_[hidden]>
Date: Sun, 7 Jan 2024 15:22:44 +0100

Le dim. 7 janv. 2024 à 13:45, Jens Maurer <jens.maurer_at_[hidden]> a écrit :

> Well, the normative references of that standard refer to ISO/IEC 10646:2020
> specifically.
>
> https://www.iso.org/obp/ui/en/#iso:std:iso-iec:8652:ed-4:v1:en

>
> The text you linked to does say
> http://www.ada-auth.org/standards/22aarm/html/AA-2-1.html#p17

>
> "The categories defined above, as well as case mapping and folding, may be
> based
> on an implementation-defined version of ISO/IEC 10646 (2003 edition or
> later)."

Note however that WG 9 has approved for a future Corrigendum
<http://www.ada-auth.org/cgi-bin/cvsweb.cgi/ai22s/ai22-0073-1.html?rev=1.4> the
addition of a versionless reference to the Unicode Character Database, with
paragraph 2.1(17) being changed to

The categories defined above, as well as case mapping and folding, may be
based on an implementation-defined version of {the Unicode Character
Database (4.0 or later)}[ISO/IEC 10646 (2003 edition or later)].
Note also that that AI has the class *binding interpretation*, the
equivalent of a C++ defect report.

That limits the freedom to character classifications and case folding,
> but nothing else (in particular, if we were to follow that lead, it's
> not obvious that the named-universal-character repertoire can be extended
> by an implementation).
>
That would be because those are the only properties Ada uses from the UCD.
I should note that the General_Category property assignments do not have
stability guarantees, whereas the names do; the names are less problematic
here.
The expansion of the répertoire is covered by the reference to the
General_Category property (characters move out of
General_Category=Unassigned).

Maybe providing volatile Unicode algorithms in the C++
> standard library isn't such a good idea, after all.
> (UTF-8 to UTF-16 is stable, but apparently some grapheme clustering isn't.)

Note that the word “stable” can mean many things
<https://www.unicode.org/policies/stability_policy.html>, from completely
immutable (encoding forms), to evolving while being immutable on the
assigned répertoire (normalization, character names), to evolving while
being backward compatible (identifiers).
These latter kinds of backward compatible stability policies are designed
to facilitate the use of versionless references to the Unicode Standard and
frequent implementation upgrades, improving the interoperability across
implementations of text interchanged using an expanding répertoire.

We also try to move carefully even where we have no formal stability
guarantees. In particular we are aware that grapheme cluster segmentation
affects many implementers out there (Swift also has it deep in its standard
libraries), especially when it comes to the state machine (the property
assignments can change more freely).
The decision <https://www.unicode.org/L2/L2023/23076.htm#175-C26> to change
the grapheme cluster breaking state machine in Unicode Version 15.1 came
after the change was tested in the wild for four years as the ICU default,
see L2/23-079
<https://www.unicode.org/L2/L2023/23079-utc175-properties-recs.pdf> Section
5.5.

Though again it seems to me that there is no conformance requirement in C++
to use any version of UAX #29 grapheme cluster breaking.
Le dim. 7 janv. 2024 à 13:12, Jonathan Wakely <cxx_at_[hidden]> a écrit :

> If I use the field width of the first code point in <some cluster that
> bears a resemblance to an extended grapheme cluster as described by
> Unicode> then that's still conforming.
>
In particular, and perhaps usefully for implementers, that reading means a
conformant implementation could rely on an ICU implementation that has
tailorings “from the future”, as ICU’s grapheme cluster breaking did from
2019 to 2023.

Best regards,

Robin Leroy

Received on 2024-01-07 14:23:05