C++ Logo

SG16

Advanced search

Subject: Re: Agreeing with Corentin's point re: problem with strict use of abstract characters
From: Corentin Jabot (corentinjabot_at_[hidden])
Date: 2020-06-10 17:44:59


Until I can give a more detail answer, here are the unicode definitions
(chapter 3)

D7 Abstract character: A unit of information used for the organization,
control, or representation of textual data.
* When representing data, the nature of that data is generally symbolic as
opposed to some other kind of data (for example, aural or visual). Examples
of
such symbolic data include letters, ideographs, digits, punctuation,
technical
symbols, and dingbats.
* An abstract character has no concrete form and should not be confused
with a
glyph.
* An abstract character does not necessarily correspond to what a user
thinks of
as a “character” and should not be confused with a grapheme.
* The abstract characters encoded by the Unicode Standard are known as
Unicode abstract characters.
* Abstract characters not directly encoded by the Unicode Standard can
often be
represented by the use of combining character sequences

D11 Encoded character: An association (or mapping) between an abstract
character and
a code point.
* An encoded character is also referred to as a coded character.
* While an encoded character is formally defined in terms of the mapping
between an abstract character and a code point, informally it can be
thought of
as an abstract character taken together with its assigned code point.
* Occasionally, for compatibility with other standards, a single abstract
character
may correspond to more than one code point—for example, “Å” corresponds
both to U+00C5 Ã… latin capital letter a with ring above and to U+212B
Ã… angstrom sign.
* A single abstract character may also be represented by a sequence of code
points—for example, latin capital letter g with acute may be represented by
the
sequence <U+0047 latin capital letter g, U+0301 combining acute
accent>, rather than being mapped to a single code point.

These last two points are some of the issues
one other is that unassigned codepoints, private use area etc can appear in
a c++ source
but are not abstract characters

In fact the Unicode standard also say that

C1 A process shall not interpret a high-surrogate code point or a
low-surrogate code point
as an abstract character.
* The high-surrogate and low-surrogate code points are designated for
surrogate
code units in the UTF-16 character encoding form. They are unassigned to any
abstract character.
C2 A process shall not interpret a noncharacter code point as an abstract
character.
* The noncharacter code points may be used internally, such as for sentinel
values or delimiters, but should not be exchanged publicly.
C3 A process shall not interpret an unassigned code point as an abstract
character.
* This clause does not preclude the assignment of certain generic semantics
to
unassigned code points (for example, rendering with a glyph to indicate the
position within a character block) that allow for graceful behavior in the
presence of code points that are outside a supported subset.
* Unassigned code points may have default property values. (See D26.)
* Code points whose use has not yet been designated may be assigned to
abstract
characters in future versions of the standard. Because of this fact, due
care in
the handling of generic semantics for such code points is likely to provide
better robustness for implementations that may encounter data based on
future versions of the standard.

If we conserve the UCN mechanism, and basic source character set, we could
use the term basic source character repertoire instead of basic source
character set.
This would work as the members of the basic source character sets represent
unique characters.

But UCNs are basically a way to encode Unicode codepoints using a limited
number of characters which themselves have a representation in memory
(internal encoding).
I do not think that indirection to be useful, but changing that hinges on
how we want to refine the implementation defined mapping in phase 1,
especially for ebcdic control characters.

And UCNs definitively represent unicode *codepoints*, not abstract
characters (there is an issue in phase 1, as it is specified that each
source character maps to 1 UCN, whereas they should be allowed to map to 1
or more UCNs).

Corentin

On Thu, 11 Jun 2020 at 00:07, Hubert Tong via SG16 <sg16_at_[hidden]>
wrote:

> On Wed, Jun 10, 2020 at 5:39 PM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
>
>> On 10/06/2020 23.23, Hubert Tong via SG16 wrote:
>> > I agree with Corentin's point that the strict use of abstract
>> characters introduces problems where a coded character set contains
>> multiple values for a single abstract character/contains characters that
>> are canonically the same but assigned different values.
>>
>> I have a hard time imagining such a thing. Can you give an example?
>>
> Yes, U+FA9A as described in https://en.wikipedia.org/wiki/Han_unification
> has this situation with U+6F22.
> These characters are distinct as members of a coded character set, but as
> abstract characters, I do not believe we can easily say the same.
>
>
>>
>> Thanks,
>> Jens
>>
>> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>



SG16 list run by sg16-owner@lists.isocpp.org