sg16: Re: [SG16] Agreeing with Corentin's point re: problem with strict use of abstract characters

From: Tom Honermann <tom_at_[hidden]>
Date: Mon, 15 Jun 2020 02:40:44 -0400

On 6/14/20 6:53 PM, Corentin Jabot wrote:
>
>
> On Mon, 15 Jun 2020 at 00:36, Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> On 6/14/20 6:21 PM, Hubert Tong wrote:
>> On Sun, Jun 14, 2020 at 6:03 PM Tom Honermann via SG16
>> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>>
>> On 6/14/20 4:57 PM, Corentin Jabot wrote:
>>>
>>>
>>> On Sun, 14 Jun 2020 at 22:45, Jens Maurer
>>> <Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>> wrote:
>>>
>>> On 14/06/2020 22.19, Corentin Jabot wrote:
>>> >
>>> >
>>> > On Sun, 14 Jun 2020 at 21:55, Jens Maurer
>>> <Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>
>>> <mailto:Jens.Maurer_at_[hidden]
>>> <mailto:Jens.Maurer_at_[hidden]>>> wrote:
>>>
>>> > No, each code point in a sequence (given Unicode
>>> input) is a separate abstract character
>>> > in my view (after combining surrogate pairs, of
>>> course).
>>> >
>>> >
>>> > For example diatrics, when preceded by a letter are
>>> not considered abstract characters of their own.
>>>
>>> "Abstract character" is defined in
>>> https://www.unicode.org/glossary/ as follows:
>>>
>>> "A unit of information used for the organization,
>>> control, or representation of textual data."
>>> (ISO 10646 does not appear to have a definition in its
>>> clause 3.)
>>>
>>> I'm not seeing a conflict between that definition and my
>>> view that a diacritic,
>>> preceded by a letter, can be viewed as two different
>>> abstract characters.
>>> I agree that the alternate viewpoint "single abstract
>>> character" is not
>>> in conflict with the definition, either.
>>>
>>> What is your statement "are not considered abstract
>>> characters of their own"
>>> (which seems to leave little room for alternatives)
>>> based on?
>>>
>>>
>>> Right the glossary, is very much incomplete
>>>
>>> The definition is given in Unicode 13. 3.4 (
>>> http://www.unicode.org/versions/Unicode13.0.0/ch03.pdf )
>>>
>>> Abstract character: A unit of information used for the
>>> organization, control, or representation of textual data.
>>> • When representing data, the nature of that data is
>>> generally symbolic as
>>> opposed to some other kind of data (for example, aural or
>>> visual). Examples of
>>> such symbolic data include letters, ideographs, digits,
>>> punctuation, technical
>>> symbols, and dingbats.
>>> • An abstract character has no concrete form and should not
>>> be confused with a
>>> glyph.
>>> • An abstract character does not necessarily correspond to
>>> what a user thinks of
>>> as a “character” and should not be confused with a grapheme.
>>> • The abstract characters encoded by the Unicode Standard
>>> are known as Unicode abstract characters.
>>> *• Abstract characters not directly encoded by the Unicode
>>> Standard can often be represented by the use of combining
>>> character sequences.
>>> *
>>> *
>>> *
>> My reading of that aligns with Jens' interpretation. An
>> abstract character can be composed from abstract characters.
>> The emphasized statement above appears to reaffirm that.
>>>
>>> The definition of encoded character is also informative
>>>
>>> Encoded character: An association (or mapping) between an
>>> abstract character and a code point.
>>> • An encoded character is also referred to as a coded character.
>>> • While an encoded character is formally defined in terms of
>>> the mapping
>>> between an abstract character and a code point, informally
>>> it can be thought of
>>> as an abstract character taken together with its assigned
>>> code point.
>>> • *Occasionally, for compatibility with other standards, a
>>> single abstract character
>>> may correspond to more than one code point—for example, “Å”
>>> corresponds
>>> both to U+00C5 Å latin capital letter a with ring above and
>>> to U+212B
>>> Å angstrom sign.
>>> • A single abstract character may also be represented by a
>>> sequence of code
>>> points—for example, latin capital letter g with acute may be
>>> represented by the
>>> sequence <U+0047 latin capital letter g, U+0301 combining acute
>>> accent>, rather than being mapped to a single code point.*
>>>
>> Likewise here, these examples indicate that an abstract
>> character may have multiple encoded representations, but I
>> don't read this as precluding the indicated code points
>> reflecting abstract characters on their own.
>>
>> It seems that, because we are not looking at a model where we
>> retain coded characters in their original form for as long as
>> possible, we're dealing with certain issues in larger scopes than
>> may be strictly necessary. Are we sure that the same text
>> processing should occur for the entirety of the source? In other
>> words, should we consider more context-dependent (e.g., specific
>> to raw strings, specific to identifiers, etc.) text processing?
>
> I've been thinking along those lines as well. I've been
> considering a model in which an /extended-source-character/ is
> introduced in phase 1 and then, in phase 3, all
> /extended-source-character/s outside of raw string literals are
> converted to /universal-character-name/s.
>
>
> I would really love it if someone could explain to me the value of
> introducing /universal-character-name/s (or
> /extended-source-character/) etc in the internal representation,
> instead of unicode codepoints, knowing that these things represent
> unicode codepoints.

I find the UCN mechanism to be quite elegant. It simultaneously
accomplishes several things:

1. It allows the source language, as seen by all phases of translation
    after phase 1 (except for the magical revert for raw string
    literals) to be completely and abstractly defined by characters
    defined in the standard. An implementation's internal
    representation only needs to differentiate 96 characters. I don't
    need to know how to type, write, read, or pronounce ሴ; I just need
    to know what \u1234 represents (abstractly; from a language
    perspective, I probably don't care what character it denotes).
2. It allows explicit encoding of Unicode code points regardless of the
    encoding of the source input.
3. It enables source input to use extended characters with no special
    support beyond phase 1 (well, beyond phase 3 because of raw string
    literals).
4. It enables an escape from Unicode should such an escape prove
    necessary (e.g., to support those EBCDIC control characters, or to
    encode whether a UCN was explicit in the source or the result of
    character conversion, or to encode which of the possible Shift-JIS
    code points a character was written in). Yes, such an escape could
    always be introduced anyway. And yes, these are edge cases, some of
    which are probably not deserving of support.

>
> It would be so much simpler to map the source physical characters
> (using existing terminology so nobody gets confused) to unicode code
> points in phase 1 and then replace universal character-names that
> appear as escape sequence in non-raw string and character literals, as
> well as header names, pp-identifier and pp-number later on when
> parsing pp-tokens in phase 3 (which would allow filtering out raw
> literals at that point)
It seems like we're converging towards do-something-in-phase-1 and then
do-something-further-in-phase-3. Good. But mapping to code points
assumes a bijective mapping between source input characters and Unicode
and we know we don't have that in all cases today. The idea of an
/extended-source-character/ is that it could carry additional
implementation-defined information (e.g., the actual code unit value(s)
for a source input character).
>
> This still doesn't help with the reversion of line splicing in raw
> string literals, but a similar approach should work for that.
>
> I think reversing that is fine - ish (see CWG1655, which i hope is one
> of the issue we can close as a result of all of that)

See also CWG1709.

Tom.

Received on 2020-06-15 01:43:57