C++ Logo

sg16

Advanced search

Re: [SG16] Agreeing with Corentin's point re: problem with strict use of abstract characters

From: Tom Honermann <tom_at_[hidden]>
Date: Sun, 14 Jun 2020 18:02:58 -0400
On 6/14/20 4:57 PM, Corentin Jabot wrote:
>
>
> On Sun, 14 Jun 2020 at 22:45, Jens Maurer <Jens.Maurer_at_[hidden]
> <mailto:Jens.Maurer_at_[hidden]>> wrote:
>
> On 14/06/2020 22.19, Corentin Jabot wrote:
> >
> >
> > On Sun, 14 Jun 2020 at 21:55, Jens Maurer <Jens.Maurer_at_[hidden]
> <mailto:Jens.Maurer_at_[hidden]> <mailto:Jens.Maurer_at_[hidden]
> <mailto:Jens.Maurer_at_[hidden]>>> wrote:
>
> > No, each code point in a sequence (given Unicode input) is a
> separate abstract character
> > in my view (after combining surrogate pairs, of course).
> >
> >
> > For example diatrics, when preceded by a letter are not
> considered abstract characters of their own.
>
> "Abstract character" is defined in
> https://www.unicode.org/glossary/ as follows:
>
> "A unit of information used for the organization, control, or
> representation of textual data."
> (ISO 10646 does not appear to have a definition in its clause 3.)
>
> I'm not seeing a conflict between that definition and my view that
> a diacritic,
> preceded by a letter, can be viewed as two different abstract
> characters.
> I agree that the alternate viewpoint "single abstract character"
> is not
> in conflict with the definition, either.
>
> What is your statement "are not considered abstract characters of
> their own"
> (which seems to leave little room for alternatives) based on?
>
>
> Right the glossary, is very much incomplete
>
> The definition is given in Unicode 13. 3.4 (
> http://www.unicode.org/versions/Unicode13.0.0/ch03.pdf )
>
> Abstract character: A unit of information used for the organization,
> control, or representation of textual data.
> • When representing data, the nature of that data is generally symbolic as
> opposed to some other kind of data (for example, aural or visual).
> Examples of
> such symbolic data include letters, ideographs, digits, punctuation,
> technical
> symbols, and dingbats.
> • An abstract character has no concrete form and should not be
> confused with a
> glyph.
> • An abstract character does not necessarily correspond to what a user
> thinks of
> as a “character” and should not be confused with a grapheme.
> • The abstract characters encoded by the Unicode Standard are known as
> Unicode abstract characters.
> *• Abstract characters not directly encoded by the Unicode Standard
> can often be represented by the use of combining character sequences.
> *
> *
> *
My reading of that aligns with Jens' interpretation. An abstract
character can be composed from abstract characters. The emphasized
statement above appears to reaffirm that.
>
> The definition of encoded character is also informative
>
> Encoded character: An association (or mapping) between an abstract
> character and a code point.
> • An encoded character is also referred to as a coded character.
> • While an encoded character is formally defined in terms of the mapping
> between an abstract character and a code point, informally it can be
> thought of
> as an abstract character taken together with its assigned code point.
> • *Occasionally, for compatibility with other standards, a single
> abstract character
> may correspond to more than one code point—for example, “Å” corresponds
> both to U+00C5 Å latin capital letter a with ring above and to U+212B
> Å angstrom sign.
> • A single abstract character may also be represented by a sequence of
> code
> points—for example, latin capital letter g with acute may be
> represented by the
> sequence <U+0047 latin capital letter g, U+0301 combining acute
> accent>, rather than being mapped to a single code point.*
>
Likewise here, these examples indicate that an abstract character may
have multiple encoded representations, but I don't read this as
precluding the indicated code points reflecting abstract characters on
their own.

Tom.



Received on 2020-06-14 17:06:09