Date: Sun, 14 Jun 2020 22:57:01 +0200
On Sun, 14 Jun 2020 at 22:45, Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
> On 14/06/2020 22.19, Corentin Jabot wrote:
> >
> >
> > On Sun, 14 Jun 2020 at 21:55, Jens Maurer <Jens.Maurer_at_[hidden] <mailto:
> Jens.Maurer_at_[hidden]>> wrote:
>
> > No, each code point in a sequence (given Unicode input) is a
> separate abstract character
> > in my view (after combining surrogate pairs, of course).
> >
> >
> > For example diatrics, when preceded by a letter are not considered
> abstract characters of their own.
>
> "Abstract character" is defined in https://www.unicode.org/glossary/ as
> follows:
>
> "A unit of information used for the organization, control, or
> representation of textual data."
> (ISO 10646 does not appear to have a definition in its clause 3.)
>
> I'm not seeing a conflict between that definition and my view that a
> diacritic,
> preceded by a letter, can be viewed as two different abstract characters.
> I agree that the alternate viewpoint "single abstract character" is not
> in conflict with the definition, either.
>
> What is your statement "are not considered abstract characters of their
> own"
> (which seems to leave little room for alternatives) based on?
>
Right the glossary, is very much incomplete
The definition is given in Unicode 13. 3.4 (
http://www.unicode.org/versions/Unicode13.0.0/ch03.pdf )
Abstract character: A unit of information used for the organization,
control, or representation of textual data.
• When representing data, the nature of that data is generally symbolic as
opposed to some other kind of data (for example, aural or visual). Examples
of
such symbolic data include letters, ideographs, digits, punctuation,
technical
symbols, and dingbats.
• An abstract character has no concrete form and should not be confused
with a
glyph.
• An abstract character does not necessarily correspond to what a user
thinks of
as a “character” and should not be confused with a grapheme.
• The abstract characters encoded by the Unicode Standard are known as
Unicode abstract characters.
*• Abstract characters not directly encoded by the Unicode Standard can
often be represented by the use of combining character sequences.*
The definition of encoded character is also informative
Encoded character: An association (or mapping) between an abstract
character and a code point.
• An encoded character is also referred to as a coded character.
• While an encoded character is formally defined in terms of the mapping
between an abstract character and a code point, informally it can be
thought of
as an abstract character taken together with its assigned code point.
•
*Occasionally, for compatibility with other standards, a single abstract
charactermay correspond to more than one code point—for example, “Å”
correspondsboth to U+00C5 Å latin capital letter a with ring above and to
U+212BÅ angstrom sign.• A single abstract character may also be represented
by a sequence of codepoints—for example, latin capital letter g with acute
may be represented by thesequence <U+0047 latin capital letter g, U+0301
combining acuteaccent>, rather than being mapped to a single code point.*
>
> Jens
>
> On 14/06/2020 22.19, Corentin Jabot wrote:
> >
> >
> > On Sun, 14 Jun 2020 at 21:55, Jens Maurer <Jens.Maurer_at_[hidden] <mailto:
> Jens.Maurer_at_[hidden]>> wrote:
>
> > No, each code point in a sequence (given Unicode input) is a
> separate abstract character
> > in my view (after combining surrogate pairs, of course).
> >
> >
> > For example diatrics, when preceded by a letter are not considered
> abstract characters of their own.
>
> "Abstract character" is defined in https://www.unicode.org/glossary/ as
> follows:
>
> "A unit of information used for the organization, control, or
> representation of textual data."
> (ISO 10646 does not appear to have a definition in its clause 3.)
>
> I'm not seeing a conflict between that definition and my view that a
> diacritic,
> preceded by a letter, can be viewed as two different abstract characters.
> I agree that the alternate viewpoint "single abstract character" is not
> in conflict with the definition, either.
>
> What is your statement "are not considered abstract characters of their
> own"
> (which seems to leave little room for alternatives) based on?
>
Right the glossary, is very much incomplete
The definition is given in Unicode 13. 3.4 (
http://www.unicode.org/versions/Unicode13.0.0/ch03.pdf )
Abstract character: A unit of information used for the organization,
control, or representation of textual data.
• When representing data, the nature of that data is generally symbolic as
opposed to some other kind of data (for example, aural or visual). Examples
of
such symbolic data include letters, ideographs, digits, punctuation,
technical
symbols, and dingbats.
• An abstract character has no concrete form and should not be confused
with a
glyph.
• An abstract character does not necessarily correspond to what a user
thinks of
as a “character” and should not be confused with a grapheme.
• The abstract characters encoded by the Unicode Standard are known as
Unicode abstract characters.
*• Abstract characters not directly encoded by the Unicode Standard can
often be represented by the use of combining character sequences.*
The definition of encoded character is also informative
Encoded character: An association (or mapping) between an abstract
character and a code point.
• An encoded character is also referred to as a coded character.
• While an encoded character is formally defined in terms of the mapping
between an abstract character and a code point, informally it can be
thought of
as an abstract character taken together with its assigned code point.
•
*Occasionally, for compatibility with other standards, a single abstract
charactermay correspond to more than one code point—for example, “Å”
correspondsboth to U+00C5 Å latin capital letter a with ring above and to
U+212BÅ angstrom sign.• A single abstract character may also be represented
by a sequence of codepoints—for example, latin capital letter g with acute
may be represented by thesequence <U+0047 latin capital letter g, U+0301
combining acuteaccent>, rather than being mapped to a single code point.*
>
> Jens
>
Received on 2020-06-14 16:00:23