C++ Logo

sg16

Advanced search

Re: [SG16] Agreeing with Corentin's point re: problem with strict use of abstract characters

From: Tom Honermann <tom_at_[hidden]>
Date: Sun, 14 Jun 2020 18:36:08 -0400
On 6/14/20 6:21 PM, Hubert Tong wrote:
> On Sun, Jun 14, 2020 at 6:03 PM Tom Honermann via SG16
> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
> On 6/14/20 4:57 PM, Corentin Jabot wrote:
>>
>>
>> On Sun, 14 Jun 2020 at 22:45, Jens Maurer <Jens.Maurer_at_[hidden]
>> <mailto:Jens.Maurer_at_[hidden]>> wrote:
>>
>> On 14/06/2020 22.19, Corentin Jabot wrote:
>> >
>> >
>> > On Sun, 14 Jun 2020 at 21:55, Jens Maurer
>> <Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>
>> <mailto:Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>>> wrote:
>>
>> > No, each code point in a sequence (given Unicode input)
>> is a separate abstract character
>> > in my view (after combining surrogate pairs, of course).
>> >
>> >
>> > For example diatrics, when preceded by a letter are not
>> considered abstract characters of their own.
>>
>> "Abstract character" is defined in
>> https://www.unicode.org/glossary/ as follows:
>>
>> "A unit of information used for the organization, control, or
>> representation of textual data."
>> (ISO 10646 does not appear to have a definition in its clause 3.)
>>
>> I'm not seeing a conflict between that definition and my view
>> that a diacritic,
>> preceded by a letter, can be viewed as two different abstract
>> characters.
>> I agree that the alternate viewpoint "single abstract
>> character" is not
>> in conflict with the definition, either.
>>
>> What is your statement "are not considered abstract
>> characters of their own"
>> (which seems to leave little room for alternatives) based on?
>>
>>
>> Right the glossary, is very much incomplete
>>
>> The definition is given in Unicode 13. 3.4 (
>> http://www.unicode.org/versions/Unicode13.0.0/ch03.pdf )
>>
>> Abstract character: A unit of information used for the
>> organization, control, or representation of textual data.
>> • When representing data, the nature of that data is generally
>> symbolic as
>> opposed to some other kind of data (for example, aural or
>> visual). Examples of
>> such symbolic data include letters, ideographs, digits,
>> punctuation, technical
>> symbols, and dingbats.
>> • An abstract character has no concrete form and should not be
>> confused with a
>> glyph.
>> • An abstract character does not necessarily correspond to what a
>> user thinks of
>> as a “character” and should not be confused with a grapheme.
>> • The abstract characters encoded by the Unicode Standard are
>> known as Unicode abstract characters.
>> *• Abstract characters not directly encoded by the Unicode
>> Standard can often be represented by the use of combining
>> character sequences.
>> *
>> *
>> *
> My reading of that aligns with Jens' interpretation. An abstract
> character can be composed from abstract characters. The
> emphasized statement above appears to reaffirm that.
>>
>> The definition of encoded character is also informative
>>
>> Encoded character: An association (or mapping) between an
>> abstract character and a code point.
>> • An encoded character is also referred to as a coded character.
>> • While an encoded character is formally defined in terms of the
>> mapping
>> between an abstract character and a code point, informally it can
>> be thought of
>> as an abstract character taken together with its assigned code point.
>> • *Occasionally, for compatibility with other standards, a single
>> abstract character
>> may correspond to more than one code point—for example, “Å”
>> corresponds
>> both to U+00C5 Å latin capital letter a with ring above and to U+212B
>> Å angstrom sign.
>> • A single abstract character may also be represented by a
>> sequence of code
>> points—for example, latin capital letter g with acute may be
>> represented by the
>> sequence <U+0047 latin capital letter g, U+0301 combining acute
>> accent>, rather than being mapped to a single code point.*
>>
> Likewise here, these examples indicate that an abstract character
> may have multiple encoded representations, but I don't read this
> as precluding the indicated code points reflecting abstract
> characters on their own.
>
> It seems that, because we are not looking at a model where we retain
> coded characters in their original form for as long as possible, we're
> dealing with certain issues in larger scopes than may be strictly
> necessary. Are we sure that the same text processing should occur for
> the entirety of the source? In other words, should we consider more
> context-dependent (e.g., specific to raw strings, specific to
> identifiers, etc.) text processing?

I've been thinking along those lines as well. I've been considering a
model in which an /extended-source-character/ is introduced in phase 1
and then, in phase 3, all /extended-source-character/s outside of raw
string literals are converted to /universal-character-name/s.

This still doesn't help with the reversion of line splicing in raw
string literals, but a similar approach should work for that.

Tom.


Received on 2020-06-14 17:39:19