Until I can give a more detail answer, here are the unicode definitions (chapter 3)
D7 Abstract character: A unit of information used for the organization, control, or representation of textual data.
* When representing data, the nature of that data is generally symbolic as
opposed to some other kind of data (for example, aural or visual). Examples of
such symbolic data include letters, ideographs, digits, punctuation, technical
symbols, and dingbats.
* An abstract character has no concrete form and should not be confused with a
* An abstract character does not necessarily correspond to what a user thinks of
as a “character” and should not be confused with a grapheme.
* The abstract characters encoded by the Unicode Standard are known as Unicode abstract characters.
* Abstract characters not directly encoded by the Unicode Standard can often be
represented by the use of combining character sequences
D11 Encoded character: An association (or mapping) between an abstract character and
a code point.
* An encoded character is also referred to as a coded character.
* While an encoded character is formally defined in terms of the mapping
between an abstract character and a code point, informally it can be thought of
as an abstract character taken together with its assigned code point.
* Occasionally, for compatibility with other standards, a single abstract character
may correspond to more than one code point—for example, “Å” corresponds
both to U+00C5 Å latin capital letter a with ring above and to U+212B
Å angstrom sign.
* A single abstract character may also be represented by a sequence of code
points—for example, latin capital letter g with acute may be represented by the
sequence <U+0047 latin capital letter g, U+0301 combining acute
accent>, rather than being mapped to a single code point.
These last two points are some of the issues
one other is that unassigned codepoints, private use area etc can appear in a c++ source
but are not abstract characters
In fact the Unicode standard also say that
C1 A process shall not interpret a high-surrogate code point or a low-surrogate code point
as an abstract character.
* The high-surrogate and low-surrogate code points are designated for surrogate
code units in the UTF-16 character encoding form. They are unassigned to any
C2 A process shall not interpret a noncharacter code point as an abstract character.
* The noncharacter code points may be used internally, such as for sentinel values or delimiters, but should not be exchanged publicly.
C3 A process shall not interpret an unassigned code point as an abstract character.
* This clause does not preclude the assignment of certain generic semantics to
unassigned code points (for example, rendering with a glyph to indicate the
position within a character block) that allow for graceful behavior in the presence of code points that are outside a supported subset.
* Unassigned code points may have default property values. (See D26.)
* Code points whose use has not yet been designated may be assigned to abstract
characters in future versions of the standard. Because of this fact, due care in
the handling of generic semantics for such code points is likely to provide better robustness for implementations that may encounter data based on future versions of the standard.
If we conserve the UCN mechanism, and basic source character set, we could use the term basic source character repertoire instead of basic source character set.
This would work as the members of the basic source character sets represent unique characters.
But UCNs are basically a way to encode Unicode codepoints using a limited number of characters which themselves have a representation in memory (internal encoding).
I do not think that indirection to be useful, but changing that hinges on how we want to refine the implementation defined mapping in phase 1, especially for ebcdic control characters.
And UCNs definitively represent unicode codepoints, not abstract characters (there is an issue in phase 1, as it is specified that each source character maps to 1 UCN, whereas they should be allowed to map to 1 or more UCNs).
On 10/06/2020 23.23, Hubert Tong via SG16 wrote:
> I agree with Corentin's point that the strict use of abstract characters introduces problems where a coded character set contains multiple values for a single abstract character/contains characters that are canonically the same but assigned different values.
I have a hard time imagining such a thing. Can you give an example?
Yes, U+FA9A as described in https://en.wikipedia.org/wiki/Han_unification
has this situation with U+6F22.
These characters are distinct as members of a coded character set, but as abstract characters, I do not believe we can easily say the same.
SG16 mailing list