Subject: Re: Agreeing with Corentin's point re: problem with strict use of abstract characters
From: Corentin Jabot (corentinjabot_at_[hidden])
Date: 2020-06-10 17:44:59
Until I can give a more detail answer, here are the unicode definitions
D7 Abstract character: A unit of information used for the organization,
control, or representation of textual data.
* When representing data, the nature of that data is generally symbolic as
opposed to some other kind of data (for example, aural or visual). Examples
such symbolic data include letters, ideographs, digits, punctuation,
symbols, and dingbats.
* An abstract character has no concrete form and should not be confused
* An abstract character does not necessarily correspond to what a user
as a âcharacterâ and should not be confused with a grapheme.
* The abstract characters encoded by the Unicode Standard are known as
Unicode abstract characters.
* Abstract characters not directly encoded by the Unicode Standard can
represented by the use of combining character sequences
D11 Encoded character: An association (or mapping) between an abstract
a code point.
* An encoded character is also referred to as a coded character.
* While an encoded character is formally defined in terms of the mapping
between an abstract character and a code point, informally it can be
as an abstract character taken together with its assigned code point.
* Occasionally, for compatibility with other standards, a single abstract
may correspond to more than one code pointâfor example, âÃ â corresponds
both to U+00C5 Ã latin capital letter a with ring above and to U+212B
Ã angstrom sign.
* A single abstract character may also be represented by a sequence of code
pointsâfor example, latin capital letter g with acute may be represented by
sequence <U+0047 latin capital letter g, U+0301 combining acute
accent>, rather than being mapped to a single code point.
These last two points are some of the issues
one other is that unassigned codepoints, private use area etc can appear in
a c++ source
but are not abstract characters
In fact the Unicode standard also say that
C1 A process shall not interpret a high-surrogate code point or a
low-surrogate code point
as an abstract character.
* The high-surrogate and low-surrogate code points are designated for
code units in the UTF-16 character encoding form. They are unassigned to any
C2 A process shall not interpret a noncharacter code point as an abstract
* The noncharacter code points may be used internally, such as for sentinel
values or delimiters, but should not be exchanged publicly.
C3 A process shall not interpret an unassigned code point as an abstract
* This clause does not preclude the assignment of certain generic semantics
unassigned code points (for example, rendering with a glyph to indicate the
position within a character block) that allow for graceful behavior in the
presence of code points that are outside a supported subset.
* Unassigned code points may have default property values. (See D26.)
* Code points whose use has not yet been designated may be assigned to
characters in future versions of the standard. Because of this fact, due
the handling of generic semantics for such code points is likely to provide
better robustness for implementations that may encounter data based on
future versions of the standard.
If we conserve the UCN mechanism, and basic source character set, we could
use the term basic source character repertoire instead of basic source
This would work as the members of the basic source character sets represent
But UCNs are basically a way to encode Unicode codepoints using a limited
number of characters which themselves have a representation in memory
I do not think that indirection to be useful, but changing that hinges on
how we want to refine the implementation defined mapping in phase 1,
especially for ebcdic control characters.
And UCNs definitively represent unicode *codepoints*, not abstract
characters (there is an issue in phase 1, as it is specified that each
source character maps to 1 UCN, whereas they should be allowed to map to 1
or more UCNs).
On Thu, 11 Jun 2020 at 00:07, Hubert Tong via SG16 <sg16_at_[hidden]>
> On Wed, Jun 10, 2020 at 5:39 PM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
>> On 10/06/2020 23.23, Hubert Tong via SG16 wrote:
>> > I agree with Corentin's point that the strict use of abstract
>> characters introduces problems where a coded character set contains
>> multiple values for a single abstract character/contains characters that
>> are canonically the same but assigned different values.
>> I have a hard time imagining such a thing. Can you give an example?
> Yes, U+FA9A as described in https://en.wikipedia.org/wiki/Han_unification
> has this situation with U+6F22.
> These characters are distinct as members of a coded character set, but as
> abstract characters, I do not believe we can easily say the same.
> SG16 mailing list
SG16 list run by firstname.lastname@example.org