C++ Logo


Advanced search

Re: [SG16] Agreeing with Corentin's point re: problem with strict use of abstract characters

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Mon, 15 Jun 2020 00:28:43 +0200
On Mon, 15 Jun 2020 at 00:22, Hubert Tong <hubert.reinterpretcast_at_[hidden]>

> On Sun, Jun 14, 2020 at 6:03 PM Tom Honermann via SG16 <
> sg16_at_[hidden]> wrote:
>> On 6/14/20 4:57 PM, Corentin Jabot wrote:
>> On Sun, 14 Jun 2020 at 22:45, Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
>>> On 14/06/2020 22.19, Corentin Jabot wrote:
>>> >
>>> >
>>> > On Sun, 14 Jun 2020 at 21:55, Jens Maurer <Jens.Maurer_at_[hidden]
>>> <mailto:Jens.Maurer_at_[hidden]>> wrote:
>>> > No, each code point in a sequence (given Unicode input) is a
>>> separate abstract character
>>> > in my view (after combining surrogate pairs, of course).
>>> >
>>> >
>>> > For example diatrics, when preceded by a letter are not considered
>>> abstract characters of their own.
>>> "Abstract character" is defined in https://www.unicode.org/glossary/ as
>>> follows:
>>> "A unit of information used for the organization, control, or
>>> representation of textual data."
>>> (ISO 10646 does not appear to have a definition in its clause 3.)
>>> I'm not seeing a conflict between that definition and my view that a
>>> diacritic,
>>> preceded by a letter, can be viewed as two different abstract characters.
>>> I agree that the alternate viewpoint "single abstract character" is not
>>> in conflict with the definition, either.
>>> What is your statement "are not considered abstract characters of their
>>> own"
>>> (which seems to leave little room for alternatives) based on?
>> Right the glossary, is very much incomplete
>> The definition is given in Unicode 13. 3.4 (
>> http://www.unicode.org/versions/Unicode13.0.0/ch03.pdf )
>> Abstract character: A unit of information used for the organization,
>> control, or representation of textual data.
>> • When representing data, the nature of that data is generally symbolic as
>> opposed to some other kind of data (for example, aural or visual).
>> Examples of
>> such symbolic data include letters, ideographs, digits, punctuation,
>> technical
>> symbols, and dingbats.
>> • An abstract character has no concrete form and should not be confused
>> with a
>> glyph.
>> • An abstract character does not necessarily correspond to what a user
>> thinks of
>> as a “character” and should not be confused with a grapheme.
>> • The abstract characters encoded by the Unicode Standard are known as
>> Unicode abstract characters.
>> *• Abstract characters not directly encoded by the Unicode Standard can
>> often be represented by the use of combining character sequences. *
>> My reading of that aligns with Jens' interpretation. An abstract
>> character can be composed from abstract characters. The emphasized
>> statement above appears to reaffirm that.
>> The definition of encoded character is also informative
>> Encoded character: An association (or mapping) between an abstract
>> character and a code point.
>> • An encoded character is also referred to as a coded character.
>> • While an encoded character is formally defined in terms of the mapping
>> between an abstract character and a code point, informally it can be
>> thought of
>> as an abstract character taken together with its assigned code point.
>> •
>> *Occasionally, for compatibility with other standards, a single abstract
>> character may correspond to more than one code point—for example, “Å”
>> corresponds both to U+00C5 Å latin capital letter a with ring above and to
>> U+212B Å angstrom sign. • A single abstract character may also be
>> represented by a sequence of code points—for example, latin capital letter
>> g with acute may be represented by the sequence <U+0047 latin capital
>> letter g, U+0301 combining acute accent>, rather than being mapped to a
>> single code point.*
>> Likewise here, these examples indicate that an abstract character may
>> have multiple encoded representations, but I don't read this as precluding
>> the indicated code points reflecting abstract characters on their own.
> It seems that, because we are not looking at a model where we retain coded
> characters in their original form for as long as possible, we're dealing
> with certain issues in larger scopes than may be strictly necessary. Are we
> sure that the same text processing should occur for the entirety of the
> source? In other words, should we consider more context-dependent (e.g.,
> specific to raw strings, specific to identifiers, etc.) text processing?

Can you be specific about which issue exactly (specifically,
implementations issues)?

Received on 2020-06-14 17:32:04