Hey folks,

Addressing the issues of Unicode versions References

TLDR:

We can update

UAX #29, Unicode Text Segmentation
The Unicode Standard Version 14.0, Core Specification

(hopefully editorially, but that second one is normative)

We should update "Unicode Standard Annex, UAX #31" as part of the rewriting of [uaxid].

We should separately resolve the issue that deriving the name of codepoints for named escape sequences from ISO and their validity as identifiers from Unicode leads to inconsistencies.

We should either use floating references consistently, or versions consistently, to ensure consistency.

=====

Bibliography

Unicode Standard Annex, UAX #29, Unicode Text Segmentation [online].

Revision 35; issued for Unicode 12.0.0

This is used in [format.string.std] for the purpose of width estimation.

The only non-editorial applicable change in that documents between that version and the last were

Moved surrogate code points from Control to XX
Excluded prepended concatenation marks from Control.

Both are bug fixes

We can update that reference

=====

Bibliography:

The Unicode Standard Version 14.0, Core Specification

This is referenced exclusively in vprint_unicode

If invoking the native Unicode API requires transcoding, implementations should substitute invalid code units with U+fffd replacement character per The Unicode Standard Version 14.0 - Core Specification, Chapter 3.9

We can update that document, the chapter number would stay the same.

And we need to refer to this document as ISO 10646 does not specify a replacement character mechanism.

We can update that reference

====

Normative reference:

Unicode Standard Annex, UAX #44, Unicode Character Database.
Available from: http://www.unicode.org/reports/tr44/

Floating reference

This describes the list of properties (such as XID_Start, XID_Continue, Grapheme_Extend, General_Category) used in both the core of library wording.

Neither the name, status, or possible values of the properties have changed.

Note that the value of these properties for individual codepoints is not governed by that annex, which just describes the list of properties as whole, and their possible values.

====

Bibliography entry:

Unicode Standard Annex, UAX #31, Unicode Identifier and Pattern Syntax [online]..
Revision 33; issued for Unicode 13.0.0.
Available from: https://www.unicode.org/reports/tr31/tr31-33.html

Changing that reference has no impact on [lex.identifier] which is only referring to XID_Start,

XID_Continue. SG16 is currently either modifying [uaxid] or removing it. we should make sure [uaxid] conforms to the last version of the annex and update the bibliography entry accordingly.

======

Normative reference:

The Unicode Standard, Derived Core Properties.

Available from: https://www.unicode.org/Public/UCD/latest/ucd/DerivedCoreProperties.txt

This one is interesting. It's a floating reference, pointing to 15.0 and used in the grammar of identifiers.

Which is great.

The issue is that for the name of identifiers, we refer to ISO 10646, which is Unicode 13 based

This means that

void f() {

    // same character spelled differently and introduced in 14.

    auto \u{16A70} = 0;

    \N{TANGSA LETTER OZ} = 1;

}

is, I guess, technically ill-formed. Ie, the set of identifiers that can be spelled by their name is smaller than the set of valid identifiers.

This is why there is a separate NB comment asking for the name of identifiers and
the XID_ properties to be extracted from the same version of unicode,
instead of one from Unicode and one from ISO 10646.