Subject: Re: Agreeing with Corentin's point re: problem with strict use of abstract characters
From: Tom Honermann (tom_at_[hidden])
Date: 2020-06-15 10:49:14
On 6/15/20 4:18 AM, Corentin Jabot wrote:
> On Mon, Jun 15, 2020, 08:40 Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
> On 6/14/20 6:53 PM, Corentin Jabot wrote:
>> On Mon, 15 Jun 2020 at 00:36, Tom Honermann <tom_at_[hidden]
>> <mailto:tom_at_[hidden]>> wrote:
>> On 6/14/20 6:21 PM, Hubert Tong wrote:
>>> On Sun, Jun 14, 2020 at 6:03 PM Tom Honermann via SG16
>>> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>>> On 6/14/20 4:57 PM, Corentin Jabot wrote:
>>>> On Sun, 14 Jun 2020 at 22:45, Jens Maurer
>>>> <Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>> wrote:
>>>> On 14/06/2020 22.19, Corentin Jabot wrote:
>>>> > On Sun, 14 Jun 2020 at 21:55, Jens Maurer
>>>> <Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>
>>>> <mailto:Jens.Maurer_at_[hidden]>>> wrote:
>>>> >Â Â Â No, each code point in a sequence (given
>>>> Unicode input) is a separate abstract character
>>>> >Â Â Â in my view (after combining surrogate pairs,
>>>> of course).
>>>> > For example diatrics, when preceded by a letter
>>>> are not considered abstract characters of their own.
>>>> "Abstract character" is defined in
>>>> https://www.unicode.org/glossary/ as follows:
>>>> "A unit of information used for the organization,
>>>> control, or representation of textual data."
>>>> (ISO 10646 does not appear to have a definition in
>>>> its clause 3.)
>>>> I'm not seeing a conflict between that definition
>>>> and my view that a diacritic,
>>>> preceded by a letter, can be viewed as two
>>>> different abstract characters.
>>>> I agree that the alternate viewpoint "single
>>>> abstract character" is not
>>>> in conflict with the definition, either.
>>>> What is your statement "are not considered abstract
>>>> characters of their own"
>>>> (which seems to leave little room for alternatives)
>>>> based on?
>>>> Right the glossary, is very muchÂ incomplete
>>>> TheÂ definition is given in Unicode 13. 3.4 (
>>>> http://www.unicode.org/versions/Unicode13.0.0/ch03.pdf%c2 )
>>>> Abstract character: A unit of information used for the
>>>> organization, control, or representation of textual data.
>>>> â¢ When representing data, the nature of that data is
>>>> generally symbolic as
>>>> opposed to some other kind of data (for example, aural
>>>> or visual). Examples of
>>>> such symbolic data include letters, ideographs, digits,
>>>> punctuation, technical
>>>> symbols, and dingbats.
>>>> â¢ An abstract character has no concrete form and should
>>>> not be confused with a
>>>> â¢ An abstract character does not necessarily correspond
>>>> to what a user thinks of
>>>> as a âcharacterâ and should not be confused with a
>>>> â¢ The abstract characters encoded by the Unicode
>>>> Standard are known as Unicode abstract characters.
>>>> *â¢ Abstract characters not directly encoded by the
>>>> Unicode Standard can often be represented by the use of
>>>> combining character sequences.
>>> My reading of that aligns with Jens' interpretation.Â An
>>> abstract character can be composed from abstract
>>> characters.Â The emphasized statement above appears to
>>> reaffirm that.
>>>> The definition of encoded character is also informative
>>>> Â Encoded character: An association (or mapping) between
>>>> an abstract character and a code point.
>>>> â¢ An encoded character is also referred to as a coded
>>>> â¢ While an encoded character is formally defined in
>>>> terms of the mapping
>>>> between an abstract character and a code point,
>>>> informally it can be thought of
>>>> as an abstract character taken together with its
>>>> assigned code point.
>>>> â¢ *Occasionally, for compatibility with other
>>>> standards, a single abstract character
>>>> may correspond to more than one code pointâfor example,
>>>> âÃ â corresponds
>>>> both to U+00C5 Ã latin capital letter a with ring above
>>>> and to U+212B
>>>> Ã angstrom sign.
>>>> â¢ A single abstract character may also be represented
>>>> by a sequence of code
>>>> pointsâfor example, latin capital letter g with acute
>>>> may be represented by the
>>>> sequence <U+0047 latin capital letter g, U+0301
>>>> combining acute
>>>> accent>, rather than being mapped to a single code point.*
>>> Likewise here, these examples indicate that an abstract
>>> character may have multiple encoded representations, but
>>> I don't read this as precluding the indicated code
>>> points reflecting abstract characters on their own.
>>> It seems that, because we are not looking at a model where
>>> we retain coded characters in their original form for as
>>> long as possible, we're dealing with certain issues in
>>> larger scopes than may be strictly necessary. Are we sure
>>> that the same text processing should occur for the entirety
>>> of the source? In other words, should we consider more
>>> context-dependent (e.g., specific to raw strings, specific
>>> to identifiers, etc.) text processing?
>> I've been thinking along those lines as well.Â I've been
>> considering a model in which an /extended-source-character/
>> is introduced in phase 1 and then, in phase 3, all
>> /extended-source-character/s outside of raw string literals
>> are converted to /universal-character-name/s.
>> I would really love it if someoneÂ could explain to me the value
>> of introducing /universal-character-name/s (or
>> /extended-source-character/) etc in the internal representation,
>> instead of unicode codepoints, knowing that these
>> thingsÂ represent unicode codepoints.
> I find the UCN mechanism to be quite elegant.Â It simultaneously
> accomplishes several things:
> 1. It allows the source language, as seen by all phases of
> translation after phase 1 (except for the magical revert for
> raw string literals) to be completely and abstractly defined
> by characters defined in the standard.Â An implementation's
> internal representation only needs to differentiate 96
> characters.Â I don't need to know how to type, write, read, or
> pronounce á´; I just need to know what \u1234 represents
> (abstractly; from a language perspective, I probably don't
> care what character it denotes)
> There is no difference between \u1234 and U+1234
I haven't understood your direction as encoding the sequence of
characters 'U', '+', '1', '2', '3', '4' where UCNs are produced today
(if that were your direction, then I agree that the only difference here
is spelling).Â I think you have a different mental model in mind in
which the internal representation encodes what today are UCNs in some
unspecified encoding form that does not serialize the code point value
as text.Â I don't see what that fixes from a standard perspective.Â I
think the current design is more elegant because it doesn't require
translating explicitly written UCNs to code points encoded in some
unspecified internal representation.
> 2. It allows explicit encoding of Unicode code points regardless
> of the encoding of the source input.
> I am not arguing against the presence of verbatim escape sequences in
I know that.Â I'm just listing it as one of my perceived benefits of
UCNs and how they solve a number of distinct issues.
> 2. It enables source input to use extended characters with no
> special support beyond phase 1 (well, beyond phase 3 because
> of raw string literals).
> What does special support means?
It means that the standard does not have to have separate rules for
handling them post phase 1 (really phase 3).
> 1. It enables an escape from Unicode should such an escape prove
> necessary (e.g., to support those EBCDIC control characters,
> or to encode whether a UCN was explicit in the source or the
> result of character conversion, or to encode which of the
> possible Shift-JIS code points a character was written in).Â
> Yes, such an escape could always be introduced anyway.Â And
> yes, these are edge cases, some of which are probably not
> deserving of support.
> The standard is explicit about there not being any observable
> difference outside of raw literal
Yes, and Hubert has claimed that the existing wording is deficient in
this area because it doesn't reflect existing practice (e.g., those
EBCDIC control characters).
>> It would be so much simpler to map the source physical characters
>> (using existing terminologyÂ so nobody gets confused) to unicode
>> code points in phase 1 and then replace universal
>> character-namesÂ that appear as escape sequence in non-raw string
>> and character literals, as well as header names, pp-identifier
>> and pp-number later on when parsing pp-tokens in phase 3 (which
>> would allow filteringÂ out raw literals at that point)
> It seems like we're converging towards do-something-in-phase-1 and
> then do-something-further-in-phase-3.Â Good.Â But mapping to code
> points assumes a bijective mapping between source input characters
> and Unicode and we know we don't have that in all cases today.Â
> The idea of an /extended-source-character/ is that it could carry
> additional implementation-defined information (e.g., the actual
> code unit value(s) for a source input character).
> I don't think anything in the wording says that today, nor do I think
> there would be any value for that
The standard implies it today by requiring reversion of phase 1 in raw
string literals.Â This approach (which, admittedly, is not at all well
described yet) would provide a way of tracking source input details that
are not reflected by code point values by themselves.
>> This still doesn't help with the reversion of line splicing
>> in raw string literals, but a similar approach should work
>> for that.
>> I think reversingÂ that is fine - ish (see CWG1655, which i hope
>> is one of the issue we can close as a result of all of that)
> See also CWG1709.
SG16 list run by email@example.com