C++ Logo

sg16

Advanced search

Re: [SG16] Agreeing with Corentin's point re: problem with strict use of abstract characters

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Sun, 14 Jun 2020 20:33:20 -0400
On Sun, Jun 14, 2020 at 6:28 PM Corentin Jabot <corentinjabot_at_[hidden]>
wrote:

>
>
> On Mon, 15 Jun 2020 at 00:22, Hubert Tong <
> hubert.reinterpretcast_at_[hidden]> wrote:
>
>> On Sun, Jun 14, 2020 at 6:03 PM Tom Honermann via SG16 <
>> sg16_at_[hidden]> wrote:
>>
>>> On 6/14/20 4:57 PM, Corentin Jabot wrote:
>>>
>>>
>>>
>>> On Sun, 14 Jun 2020 at 22:45, Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
>>>
>>>> On 14/06/2020 22.19, Corentin Jabot wrote:
>>>> >
>>>> >
>>>> > On Sun, 14 Jun 2020 at 21:55, Jens Maurer <Jens.Maurer_at_[hidden]
>>>> <mailto:Jens.Maurer_at_[hidden]>> wrote:
>>>>
>>>> > No, each code point in a sequence (given Unicode input) is a
>>>> separate abstract character
>>>> > in my view (after combining surrogate pairs, of course).
>>>> >
>>>> >
>>>> > For example diatrics, when preceded by a letter are not considered
>>>> abstract characters of their own.
>>>>
>>>> "Abstract character" is defined in https://www.unicode.org/glossary/
>>>> as follows:
>>>>
>>>> "A unit of information used for the organization, control, or
>>>> representation of textual data."
>>>> (ISO 10646 does not appear to have a definition in its clause 3.)
>>>>
>>>> I'm not seeing a conflict between that definition and my view that a
>>>> diacritic,
>>>> preceded by a letter, can be viewed as two different abstract
>>>> characters.
>>>> I agree that the alternate viewpoint "single abstract character" is not
>>>> in conflict with the definition, either.
>>>>
>>>> What is your statement "are not considered abstract characters of their
>>>> own"
>>>> (which seems to leave little room for alternatives) based on?
>>>>
>>>
>>> Right the glossary, is very much incomplete
>>>
>>> The definition is given in Unicode 13. 3.4 (
>>> http://www.unicode.org/versions/Unicode13.0.0/ch03.pdf )
>>>
>>> Abstract character: A unit of information used for the organization,
>>> control, or representation of textual data.
>>> • When representing data, the nature of that data is generally symbolic
>>> as
>>> opposed to some other kind of data (for example, aural or visual).
>>> Examples of
>>> such symbolic data include letters, ideographs, digits, punctuation,
>>> technical
>>> symbols, and dingbats.
>>> • An abstract character has no concrete form and should not be confused
>>> with a
>>> glyph.
>>> • An abstract character does not necessarily correspond to what a user
>>> thinks of
>>> as a “character” and should not be confused with a grapheme.
>>> • The abstract characters encoded by the Unicode Standard are known as
>>> Unicode abstract characters.
>>>
>>> *• Abstract characters not directly encoded by the Unicode Standard can
>>> often be represented by the use of combining character sequences. *
>>>
>>> My reading of that aligns with Jens' interpretation. An abstract
>>> character can be composed from abstract characters. The emphasized
>>> statement above appears to reaffirm that.
>>>
>>>
>>> The definition of encoded character is also informative
>>>
>>> Encoded character: An association (or mapping) between an abstract
>>> character and a code point.
>>> • An encoded character is also referred to as a coded character.
>>> • While an encoded character is formally defined in terms of the mapping
>>> between an abstract character and a code point, informally it can be
>>> thought of
>>> as an abstract character taken together with its assigned code point.
>>> •
>>>
>>>
>>>
>>>
>>>
>>>
>>> *Occasionally, for compatibility with other standards, a single abstract
>>> character may correspond to more than one code point—for example, “Å”
>>> corresponds both to U+00C5 Å latin capital letter a with ring above and to
>>> U+212B Å angstrom sign. • A single abstract character may also be
>>> represented by a sequence of code points—for example, latin capital letter
>>> g with acute may be represented by the sequence <U+0047 latin capital
>>> letter g, U+0301 combining acute accent>, rather than being mapped to a
>>> single code point.*
>>>
>>> Likewise here, these examples indicate that an abstract character may
>>> have multiple encoded representations, but I don't read this as precluding
>>> the indicated code points reflecting abstract characters on their own.
>>>
>> It seems that, because we are not looking at a model where we retain
>> coded characters in their original form for as long as possible, we're
>> dealing with certain issues in larger scopes than may be strictly
>> necessary. Are we sure that the same text processing should occur for the
>> entirety of the source? In other words, should we consider more
>> context-dependent (e.g., specific to raw strings, specific to identifiers,
>> etc.) text processing?
>>
>
> Can you be specific about which issue exactly (specifically,
> implementations issues)?
>
I am not talking about implementation issues. I am talking about the
apparent difficulty in finding consensus on continuing with a broad
approach that is perceived to lose information early. It seems that looking
at the finer-grained contexts would help increase the confidence of the
participants.

Received on 2020-06-14 19:36:48