C++ Logo

sg16

Advanced search

Unicode References

From: Corentin <corentin.jabot_at_[hidden]>
Date: Thu, 20 Oct 2022 11:02:02 +0200
Hey folks,


Addressing the issues of Unicode versions References

TLDR:

We can update


   - UAX #29, Unicode Text Segmentation
   - The Unicode Standard Version 14.0, Core Specification


(hopefully editorially, but that second one is normative)

We should update "Unicode Standard Annex, UAX #31" as part of the
rewriting of [uaxid].

We should separately resolve the issue that deriving the name of codepoints
for named escape sequences from ISO and their validity as identifiers from
Unicode leads to inconsistencies.

We should either use floating references consistently, or versions
consistently, to ensure consistency.


=====
Bibliography
Unicode Standard Annex, UAX #29, *Unicode Text Segmentation* [online].
<http://eel.is/c++draft/bibliography#sentence-2>
Revision 35; issued for Unicode 12.0.0

This is used in [format.string.std] for the purpose of width estimation.
The only non-editorial applicable change in that documents between that
version and the last were

   - Moved surrogate code points from *Control
   <https://www.unicode.org/reports/tr29/tr29-35.html#Control>* to XX
   - Excluded prepended concatenation marks from Control
   <https://www.unicode.org/reports/tr29/tr29-37.html#Control>.

Both are bug fixes

We can update that reference

=====
Bibliography:

The Unicode Standard Version 14.0, *Core Specification*

This is referenced exclusively in vprint_unicode

If invoking the native Unicode API requires transcoding, implementations
should substitute invalid code units with U+fffd replacement character per
The Unicode Standard Version 14.0 - Core Specification, Chapter 3.9

We can update that document, the chapter number would stay the same.
And we need to refer to this document as ISO 10646 does not specify a
replacement character mechanism.

*We can update that reference*
====
Normative reference:

Unicode Standard Annex, UAX #44, Unicode Character Database.
Available from: http://www.unicode.org/reports/tr44/
*Floating reference*

This describes the list of properties (such as
 XID_Start, XID_Continue, Grapheme_­Extend, General_­Category) used in both
the core of library wording.

Neither the name, status, or possible values of the properties have changed.

Note that the value of these properties for individual codepoints is not
governed by that annex, which just describes the list of properties as
whole, and their possible values.

====
Bibliography entry:

Unicode Standard Annex, UAX #31, Unicode Identifier and Pattern Syntax
[online]..
Revision 33; *issued for Unicode 13.0.0*.
Available from: https://www.unicode.org/reports/tr31/tr31-33.html

Changing that reference has no impact on [lex.identifier] which is only
referring to XID_Start,
XID_Continue. SG16 is currently either modifying [uaxid] or removing it. we
should make sure [uaxid] conforms to the last version of the annex and
update the bibliography entry accordingly.

======
Normative reference:
The Unicode Standard, *Derived Core Properties*.
<http://eel.is/c++draft/full#intro.refs-1.13.sentence-2>
Available from:
https://www.unicode.org/Public/UCD/latest/ucd/DerivedCoreProperties.txt

This one is interesting. It's a floating reference, pointing to 15.0 and
used in the grammar of identifiers.
Which is great.

The issue is that for the name of identifiers, we refer to ISO 10646, which
is Unicode 13 based

This means that

void f() {
// same character spelled differently and introduced in 14.
auto \u{16A70} = 0;
\N{TANGSA LETTER OZ} = 1;
}

is, I guess, technically ill-formed. Ie, the set of identifiers that can be
spelled by their name is smaller than the set of valid identifiers.

This is why there is a separate NB comment asking for the name of
identifiers and
the XID_ properties to be extracted from the same version of unicode,
instead of one from Unicode and one from ISO 10646.

Received on 2022-10-20 09:02:16