C++ Logo

sg16

Advanced search

Re: [SG16] Agreeing with Corentin's point re: problem with strict use of abstract characters

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Mon, 15 Jun 2020 00:27:06 +0200
On Mon, 15 Jun 2020 at 00:03, Tom Honermann <tom_at_[hidden]> wrote:

> On 6/14/20 4:57 PM, Corentin Jabot wrote:
>
>
>
> On Sun, 14 Jun 2020 at 22:45, Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
>
>> On 14/06/2020 22.19, Corentin Jabot wrote:
>> >
>> >
>> > On Sun, 14 Jun 2020 at 21:55, Jens Maurer <Jens.Maurer_at_[hidden] <mailto:
>> Jens.Maurer_at_[hidden]>> wrote:
>>
>> > No, each code point in a sequence (given Unicode input) is a
>> separate abstract character
>> > in my view (after combining surrogate pairs, of course).
>> >
>> >
>> > For example diatrics, when preceded by a letter are not considered
>> abstract characters of their own.
>>
>> "Abstract character" is defined in https://www.unicode.org/glossary/ as
>> follows:
>>
>> "A unit of information used for the organization, control, or
>> representation of textual data."
>> (ISO 10646 does not appear to have a definition in its clause 3.)
>>
>> I'm not seeing a conflict between that definition and my view that a
>> diacritic,
>> preceded by a letter, can be viewed as two different abstract characters.
>> I agree that the alternate viewpoint "single abstract character" is not
>> in conflict with the definition, either.
>>
>> What is your statement "are not considered abstract characters of their
>> own"
>> (which seems to leave little room for alternatives) based on?
>>
>
> Right the glossary, is very much incomplete
>
> The definition is given in Unicode 13. 3.4 (
> http://www.unicode.org/versions/Unicode13.0.0/ch03.pdf )
>
> Abstract character: A unit of information used for the organization,
> control, or representation of textual data.
> • When representing data, the nature of that data is generally symbolic as
> opposed to some other kind of data (for example, aural or visual).
> Examples of
> such symbolic data include letters, ideographs, digits, punctuation,
> technical
> symbols, and dingbats.
> • An abstract character has no concrete form and should not be confused
> with a
> glyph.
> • An abstract character does not necessarily correspond to what a user
> thinks of
> as a “character” and should not be confused with a grapheme.
> • The abstract characters encoded by the Unicode Standard are known as
> Unicode abstract characters.
>
> *• Abstract characters not directly encoded by the Unicode Standard can
> often be represented by the use of combining character sequences. *
>
> My reading of that aligns with Jens' interpretation. An abstract
> character can be composed from abstract characters. The emphasized
> statement above appears to reaffirm that.
>

 I Sent a mail to the unicode experts, we shall have an answer :)

>
> The definition of encoded character is also informative
>
> Encoded character: An association (or mapping) between an abstract
> character and a code point.
> • An encoded character is also referred to as a coded character.
> • While an encoded character is formally defined in terms of the mapping
> between an abstract character and a code point, informally it can be
> thought of
> as an abstract character taken together with its assigned code point.
> •
>
>
>
>
>
>
> *Occasionally, for compatibility with other standards, a single abstract
> character may correspond to more than one code point—for example, “Å”
> corresponds both to U+00C5 Å latin capital letter a with ring above and to
> U+212B Å angstrom sign. • A single abstract character may also be
> represented by a sequence of code points—for example, latin capital letter
> g with acute may be represented by the sequence <U+0047 latin capital
> letter g, U+0301 combining acute accent>, rather than being mapped to a
> single code point.*
>
> Likewise here, these examples indicate that an abstract character may have
> multiple encoded representations, but I don't read this as precluding the
> indicated code points reflecting abstract characters on their own.
>
> Tom.
>
>
>

Received on 2020-06-14 17:30:29