C++ Logo

sg16

Advanced search

Re: Unicode References

From: Tom Honermann <tom_at_[hidden]>
Date: Thu, 20 Oct 2022 18:32:14 -0400
Thanks, Corentin.

For reference, this is related to NB comment FR-010.

Tom.

On 10/20/22 5:02 AM, Corentin via SG16 wrote:
> Hey folks,
>
>
> Addressing the issues of Unicode versions References
>
> TLDR:
>
> We can update
>
> * UAX #29, Unicode Text Segmentation
> * The Unicode Standard Version 14.0, Core Specification
>
>
> (hopefully editorially, but that second one is normative)
>
> We should update "Unicode Standard Annex, UAX #31" as part of the
> rewriting of [uaxid].
>
> We should separately resolve the issue that deriving the name of
> codepoints for named escape sequences from ISO and their validity as
> identifiers from Unicode leads to inconsistencies.
>
> We should either use floating references consistently, or versions
> consistently, to ensure consistency.
>
>
> =====
> Bibliography
> Unicode Standard Annex, UAX #29, /Unicode Text Segmentation/ [online].
> <http://eel.is/c++draft/bibliography#sentence-2>
> Revision 35; issued for Unicode 12.0.0
>
> This is used in [format.string.std] for the purpose of width estimation.
> The only non-editorial applicable change in that documents between
> that version and the last were
>
> * Moved surrogate code points from *Control
> <https://www.unicode.org/reports/tr29/tr29-35.html#Control>* to XX
> * Excluded prepended concatenation marks from Control
> <https://www.unicode.org/reports/tr29/tr29-37.html#Control>.
>
> Both are bug fixes
>
> We can update that reference
>
> =====
> Bibliography:
>
> The Unicode Standard Version 14.0, /Core Specification/
> /
> /
> This is referenced exclusively in vprint_unicode
>
> If invoking the native Unicode API requires transcoding,
> implementations should substitute invalid code units with U+fffd
> replacement character per The Unicode Standard Version 14.0 - Core
> Specification, Chapter 3.9
>
> We can update that document, the chapter number would stay the same.
> And we need to refer to this document as ISO 10646 does not specify a
> replacement character mechanism.
>
> *We can update that reference*
> ====
> Normative reference:
>
> Unicode Standard Annex, UAX #44, Unicode Character Database.
> Available from: http://www.unicode.org/reports/tr44//
> /
> *Floating reference*
> /
> /
> This describes the list of properties (such
> as XID_Start, XID_Continue, Grapheme_­Extend, General_­Category) used
> in both the core of library wording.
>
> Neither the name, status, or possible values of the properties have
> changed.
>
> Note that the value of these properties for individual codepoints is
> not governed by that annex, which just describes the list of
> properties as whole, and their possible values.
>
> ====
> Bibliography entry:
>
> Unicode Standard Annex, UAX #31, Unicode Identifier and Pattern Syntax
> [online]..
> Revision 33; *issued for Unicode 13.0.0*.
> Available from: https://www.unicode.org/reports/tr31/tr31-33.html
>
> Changing that reference has no impact on [lex.identifier] which is
> only referring to XID_Start,
> XID_Continue. SG16 is currently either modifying [uaxid] or removing
> it. we should make sure [uaxid] conforms to the last version of the
> annex and update the bibliography entry accordingly.
>
> ======
> Normative reference:
> The Unicode Standard, /Derived Core Properties/.
> <http://eel.is/c++draft/full#intro.refs-1.13.sentence-2>
> Available from:
> https://www.unicode.org/Public/UCD/latest/ucd/DerivedCoreProperties.txt
>
> This one is interesting. It's a floating reference, pointing to 15.0
> and used in the grammar of identifiers.
> Which is great.
>
> The issue is that for the name of identifiers, we refer to ISO 10646,
> which is Unicode 13 based
>
> This means that
>
> void f() {
> // same character spelled differently and introduced in 14.
> auto \u{16A70} = 0;
> \N{TANGSA LETTER OZ} = 1;
> }
> is, I guess, technically ill-formed. Ie, the set of identifiers that
> can be spelled by their name is smaller than the set of valid identifiers.
>
> This is why there is a separate NB comment asking for the name of
> identifiers and
> the XID_ properties to be extracted from the same version of unicode,
> instead of one from Unicode and one from ISO 10646.
>
>
>
>
>
>
>
>
>
>
> /
> /
> /
> /
> /
> /
> /
> /
> /
> /
> /
> /
> /
> /
> /
> /
> /
> /
> /
> /
> /
> /
>
>
>

Received on 2022-10-20 22:32:16