sg16: Re: [SG16] Agreeing with Corentin's point re: problem with strict use of abstract characters

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Mon, 15 Jun 2020 10:18:39 +0200

On Mon, Jun 15, 2020, 08:40 Tom Honermann <tom_at_[hidden]> wrote:

> On 6/14/20 6:53 PM, Corentin Jabot wrote:
>
>
>
> On Mon, 15 Jun 2020 at 00:36, Tom Honermann <tom_at_[hidden]> wrote:
>
>> On 6/14/20 6:21 PM, Hubert Tong wrote:
>>
>> On Sun, Jun 14, 2020 at 6:03 PM Tom Honermann via SG16 <
>> sg16_at_[hidden]> wrote:
>>
>>> On 6/14/20 4:57 PM, Corentin Jabot wrote:
>>>
>>>
>>>
>>> On Sun, 14 Jun 2020 at 22:45, Jens Maurer <Jens.Maurer_at_[hidden]> wrote:
>>>
>>>> On 14/06/2020 22.19, Corentin Jabot wrote:
>>>> >
>>>> >
>>>> > On Sun, 14 Jun 2020 at 21:55, Jens Maurer <Jens.Maurer_at_[hidden]
>>>> <mailto:Jens.Maurer_at_[hidden]>> wrote:
>>>>
>>>> > No, each code point in a sequence (given Unicode input) is a
>>>> separate abstract character
>>>> > in my view (after combining surrogate pairs, of course).
>>>> >
>>>> >
>>>> > For example diatrics, when preceded by a letter are not considered
>>>> abstract characters of their own.
>>>>
>>>> "Abstract character" is defined in https://www.unicode.org/glossary/
>>>> as follows:
>>>>
>>>> "A unit of information used for the organization, control, or
>>>> representation of textual data."
>>>> (ISO 10646 does not appear to have a definition in its clause 3.)
>>>>
>>>> I'm not seeing a conflict between that definition and my view that a
>>>> diacritic,
>>>> preceded by a letter, can be viewed as two different abstract
>>>> characters.
>>>> I agree that the alternate viewpoint "single abstract character" is not
>>>> in conflict with the definition, either.
>>>>
>>>> What is your statement "are not considered abstract characters of their
>>>> own"
>>>> (which seems to leave little room for alternatives) based on?
>>>>
>>>
>>> Right the glossary, is very much incomplete
>>>
>>> The definition is given in Unicode 13. 3.4 (
>>> http://www.unicode.org/versions/Unicode13.0.0/ch03.pdf )
>>>
>>> Abstract character: A unit of information used for the organization,
>>> control, or representation of textual data.
>>> • When representing data, the nature of that data is generally symbolic
>>> as
>>> opposed to some other kind of data (for example, aural or visual).
>>> Examples of
>>> such symbolic data include letters, ideographs, digits, punctuation,
>>> technical
>>> symbols, and dingbats.
>>> • An abstract character has no concrete form and should not be confused
>>> with a
>>> glyph.
>>> • An abstract character does not necessarily correspond to what a user
>>> thinks of
>>> as a “character” and should not be confused with a grapheme.
>>> • The abstract characters encoded by the Unicode Standard are known as
>>> Unicode abstract characters.
>>>
>>> *• Abstract characters not directly encoded by the Unicode Standard can
>>> often be represented by the use of combining character sequences. *
>>>
>>> My reading of that aligns with Jens' interpretation. An abstract
>>> character can be composed from abstract characters. The emphasized
>>> statement above appears to reaffirm that.
>>>
>>>
>>> The definition of encoded character is also informative
>>>
>>> Encoded character: An association (or mapping) between an abstract
>>> character and a code point.
>>> • An encoded character is also referred to as a coded character.
>>> • While an encoded character is formally defined in terms of the mapping
>>> between an abstract character and a code point, informally it can be
>>> thought of
>>> as an abstract character taken together with its assigned code point.
>>> •
>>>
>>>
>>>
>>>
>>>
>>>
>>> *Occasionally, for compatibility with other standards, a single abstract
>>> character may correspond to more than one code point—for example, “Å”
>>> corresponds both to U+00C5 Å latin capital letter a with ring above and to
>>> U+212B Å angstrom sign. • A single abstract character may also be
>>> represented by a sequence of code points—for example, latin capital letter
>>> g with acute may be represented by the sequence <U+0047 latin capital
>>> letter g, U+0301 combining acute accent>, rather than being mapped to a
>>> single code point.*
>>>
>>> Likewise here, these examples indicate that an abstract character may
>>> have multiple encoded representations, but I don't read this as precluding
>>> the indicated code points reflecting abstract characters on their own.
>>>
>> It seems that, because we are not looking at a model where we retain
>> coded characters in their original form for as long as possible, we're
>> dealing with certain issues in larger scopes than may be strictly
>> necessary. Are we sure that the same text processing should occur for the
>> entirety of the source? In other words, should we consider more
>> context-dependent (e.g., specific to raw strings, specific to identifiers,
>> etc.) text processing?
>>
>> I've been thinking along those lines as well. I've been considering a
>> model in which an *extended-source-character* is introduced in phase 1
>> and then, in phase 3, all *extended-source-character*s outside of raw
>> string literals are converted to *universal-character-name*s.
>>
>
> I would really love it if someone could explain to me the value of
> introducing *universal-character-name*s (or *extended-source-character*)
> etc in the internal representation,
> instead of unicode codepoints, knowing that these things represent unicode
> codepoints.
>
> I find the UCN mechanism to be quite elegant. It simultaneously
> accomplishes several things:
>
> 1. It allows the source language, as seen by all phases of translation
> after phase 1 (except for the magical revert for raw string literals) to be
> completely and abstractly defined by characters defined in the standard.
> An implementation's internal representation only needs to differentiate 96
> characters. I don't need to know how to type, write, read, or pronounce ሴ;
> I just need to know what \u1234 represents (abstractly; from a language
> perspective, I probably don't care what character it denotes)
>
>
There is no difference between \u1234 and U+1234

> 1.
> 2. It allows explicit encoding of Unicode code points regardless of
> the encoding of the source input.
>
> I am not arguing against the presence of verbatim escape sequences in
source

>
> 1.
> 2. It enables source input to use extended characters with no special
> support beyond phase 1 (well, beyond phase 3 because of raw string
> literals).
>
> What does special support means?

> 1. It enables an escape from Unicode should such an escape prove
> necessary (e.g., to support those EBCDIC control characters, or to encode
> whether a UCN was explicit in the source or the result of character
> conversion, or to encode which of the possible Shift-JIS code points a
> character was written in). Yes, such an escape could always be introduced
> anyway. And yes, these are edge cases, some of which are probably not
> deserving of support.
>
>
The standard is explicit about there not being any observable difference
outside of raw literal

>
> 1.
>
>
> It would be so much simpler to map the source physical characters (using
> existing terminology so nobody gets confused) to unicode code points in
> phase 1 and then replace universal character-names that appear as escape
> sequence in non-raw string and character literals, as well as header names,
> pp-identifier and pp-number later on when parsing pp-tokens in phase 3
> (which would allow filtering out raw literals at that point)
>
> It seems like we're converging towards do-something-in-phase-1 and then
> do-something-further-in-phase-3. Good. But mapping to code points assumes
> a bijective mapping between source input characters and Unicode and we know
> we don't have that in all cases today. The idea of an
> *extended-source-character* is that it could carry additional
> implementation-defined information (e.g., the actual code unit value(s) for
> a source input character).
>

I don't think anything in the wording says that today, nor do I think there
would be any value for that

> This still doesn't help with the reversion of line splicing in raw string
>> literals, but a similar approach should work for that.
>>
> I think reversing that is fine - ish (see CWG1655, which i hope is one of
> the issue we can close as a result of all of that)
>
> See also CWG1709.
>
> Tom.
>

Received on 2020-06-15 03:22:02