C++ Logo

SG16

Advanced search

Subject: Re: Agreeing with Corentin's point re: problem with strict use of abstract characters
From: Corentin Jabot (corentinjabot_at_[hidden])
Date: 2020-06-15 03:48:09


On Mon, 15 Jun 2020 at 00:27, Corentin Jabot <corentinjabot_at_[hidden]>
wrote:

>
>
> On Mon, 15 Jun 2020 at 00:03, Tom Honermann <tom_at_[hidden]> wrote:
>
>> On 6/14/20 4:57 PM, Corentin Jabot wrote:
>>
>>
>>
>> On Sun, 14 Jun 2020 at 22:45, Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
>>
>>> On 14/06/2020 22.19, Corentin Jabot wrote:
>>> >
>>> >
>>> > On Sun, 14 Jun 2020 at 21:55, Jens Maurer <Jens.Maurer_at_[hidden]
>>> <mailto:Jens.Maurer_at_[hidden]>> wrote:
>>>
>>> > No, each code point in a sequence (given Unicode input) is a
>>> separate abstract character
>>> > in my view (after combining surrogate pairs, of course).
>>> >
>>> >
>>> > For example diatrics, when preceded by a letter are not considered
>>> abstract characters of their own.
>>>
>>> "Abstract character" is defined in https://www.unicode.org/glossary/ as
>>> follows:
>>>
>>> "A unit of information used for the organization, control, or
>>> representation of textual data."
>>> (ISO 10646 does not appear to have a definition in its clause 3.)
>>>
>>> I'm not seeing a conflict between that definition and my view that a
>>> diacritic,
>>> preceded by a letter, can be viewed as two different abstract characters.
>>> I agree that the alternate viewpoint "single abstract character" is not
>>> in conflict with the definition, either.
>>>
>>> What is your statement "are not considered abstract characters of their
>>> own"
>>> (which seems to leave little room for alternatives) based on?
>>>
>>
>> Right the glossary, is very much incomplete
>>
>> The definition is given in Unicode 13. 3.4 (
>> http://www.unicode.org/versions/Unicode13.0.0/ch03.pdf )
>>
>> Abstract character: A unit of information used for the organization,
>> control, or representation of textual data.
>> • When representing data, the nature of that data is generally symbolic as
>> opposed to some other kind of data (for example, aural or visual).
>> Examples of
>> such symbolic data include letters, ideographs, digits, punctuation,
>> technical
>> symbols, and dingbats.
>> • An abstract character has no concrete form and should not be confused
>> with a
>> glyph.
>> • An abstract character does not necessarily correspond to what a user
>> thinks of
>> as a “character” and should not be confused with a grapheme.
>> • The abstract characters encoded by the Unicode Standard are known as
>> Unicode abstract characters.
>>
>> *• Abstract characters not directly encoded by the Unicode Standard can
>> often be represented by the use of combining character sequences. *
>>
>> My reading of that aligns with Jens' interpretation. An abstract
>> character can be composed from abstract characters. The emphasized
>> statement above appears to reaffirm that.
>>
>
> I Sent a mail to the unicode experts, we shall have an answer :)
>

The experts agree with you

*Now the abstract character A-diaresis (Ä) is encode by a single code
point and also has a canonically equivalent representation by a combining
sequence. In effect, the whole sequence "encodes" a single abstract
character, but that is formally not how Unicode defines it.*

*A diaeresis is a recognizable item of the writing system; if used as an
umlaut, it tends to act as a decoration of character that is more-or-less
seen as a new entity (particularly in Swedish) and less a modified letter
A. If used as a diaeresis, it acts more like a punctuation mark that has a
function of its own (forcing separate pronunciation). Even though it's
graphically applied to a vowel, it can be understood as its own abstract
character.*

*Treating the diaerersis as its own independent abstract character makes
logical and not just formal sense. That may not be the case equally for all
types of diacritical marks. However, since they can all be named, and thus
arguably exist as their own concepts at least on a descriptive level, it
becomes effectively a non-problem.*

*The way combining marks are treated in other scripts, they can all be on
different points of the scale as logically independent entities, and some
are even on different points of the scale in terms of graphically combining
(they may be graphically indistinguishable from regular spacing letters).To
recap, an "abstract" character is a conceptual character, something that
forms the atom of a writing system (smallest divisible particle) as viewed
from the process of encoding, which associates with it a single code point.
"Abstract" characters may exist that are not encoded; and some of them can
be analyzed as series of smaller abstract characters, and thus be
represented as code point sequences.*

*Some abstract characters are more like small molecules; they can be
encoded as such, or they can also have a more atomic sequence that
represents them. The rationale of for allowing this dual nature is
historical compatibility, not logical necessity, hence the model is in some
ways not "pure" (just practical).*

>
>> The definition of encoded character is also informative
>>
>> Encoded character: An association (or mapping) between an abstract
>> character and a code point.
>> • An encoded character is also referred to as a coded character.
>> • While an encoded character is formally defined in terms of the mapping
>> between an abstract character and a code point, informally it can be
>> thought of
>> as an abstract character taken together with its assigned code point.
>> •
>>
>>
>>
>>
>>
>>
>> *Occasionally, for compatibility with other standards, a single abstract
>> character may correspond to more than one code point—for example, “Å”
>> corresponds both to U+00C5 Ã… latin capital letter a with ring above and to
>> U+212B Å angstrom sign. • A single abstract character may also be
>> represented by a sequence of code points—for example, latin capital letter
>> g with acute may be represented by the sequence <U+0047 latin capital
>> letter g, U+0301 combining acute accent>, rather than being mapped to a
>> single code point.*
>>
>> Likewise here, these examples indicate that an abstract character may
>> have multiple encoded representations, but I don't read this as precluding
>> the indicated code points reflecting abstract characters on their own.
>>
>> Tom.
>>
>>
>>



SG16 list run by sg16-owner@lists.isocpp.org