C++ Logo

sg16

Advanced search

Re: [SG16] The Unicode Standard vs 10646 (which is defective)

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Sat, 6 Nov 2021 14:00:57 +0100
On 06/11/2021 05.24, Steve Downey via SG16 wrote:
> From 24.1 "Character Names List" of the Unicode Standard 14.0 (the upstream document that seems to be well maintained)
>
> Normative Aliases
> A normative character name alias is a formal, unique, and stable alternate name for a character. In limited circumstances, characters are given normative character name aliases where there is a defect in the character name. These normative aliases do not replace the character name, but rather allow users to refer formally to the character without requiring the use of a defective name. For more information, see Section 4.8, Name.
>
> Normative aliases which provide information about corrections to defective character names or which provide alternate names in wide use for a Unicode format character are printed in the character names list, preceded by a special symbol ". Normative aliases serving other purposes, if listed, are shown by convention in all caps, following an “=”. Normative aliases of type “figment” for control codes are not listed. Normative aliases which represent commonly used abbreviations for control codes or format characters are shown in all caps, enclosed in parentheses. In contrast, informative aliases are shown in lowercase. For the definitive list of normative aliases, also including their type and suitable for machine parsing, see NameAliases.txt in the UCD.
>
>
> So, according to this, the parts in parenthesis are abbreviations, the ALL CAPS are normative aliases, which includes the ones listed for control codes.
> Some of this is captured in the NamesList.txt, and some of it is captured in the software that normatively (for the unicode standard) processes that file.

Apparently.

Unicode 14 NameAliases.txt says

000A;LINE FEED;control
000A;NEW LINE;control
000A;END OF LINE;control

which seems to say that those three aliases are of the same kind.

Yet, Unicode 14 CodeCharts.pdf says

000A <control>
= LINE FEED (LF)
= new line (NL)
= end of line (EOL)

which appears to say that "new line" and "end of line" are second-
class (informative) aliases, because they are lowercase.

We need to make a decision whether we want to avail C++ of all
three aliases, or just the first one.


One more issue:

Unicode 14 NameAliases.txt says

# Note that no formal name alias for the ISO 6429 "BELL" is
# provided for U+0007, because of the existing name collision
# with U+1F514 BELL.

0007;ALERT;control
0007;BEL;abbreviation

Yet, Unicode 14 CodeCharts.pdf says

0007 <control>
= BELL


and about a thousand pages later

1F514 BELL
→ 0FC4 tibetan symbol dril bu
→ 2407 symbol for bell
→ 1F56D ringing bell


I've sent an e-mail to unicode_at_[hidden]

> I am not going to claim that we can read that out of 10646. I think 10646 is not actually fit for purpose. The description of the code charts is insufficient, and in any case is not machine readable which is actually required for fidelity here.
> I am intending to use the "normative aliases" for control codes as described in the Unicode standard to produce a table to be included in our standard. I believe this captures the intent of what we agreed.

I'd suggest use all "control" aliases from NameAliases.txt.

Jens

Received on 2021-11-06 08:01:07