Date: Mon, 15 Jun 2020 12:52:42 -0400
On 6/15/20 12:17 PM, Corentin Jabot wrote:
>
>
> On Mon, 15 Jun 2020 at 17:49, Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> On 6/15/20 4:18 AM, Corentin Jabot wrote:
>>
>>
>> On Mon, Jun 15, 2020, 08:40 Tom Honermann <tom_at_[hidden]
>> <mailto:tom_at_[hidden]>> wrote:
>>
>> On 6/14/20 6:53 PM, Corentin Jabot wrote:
>>>
>>>
>>> On Mon, 15 Jun 2020 at 00:36, Tom Honermann
>>> <tom_at_[hidden] <mailto:tom_at_[hidden]>> wrote:
>>>
>>> On 6/14/20 6:21 PM, Hubert Tong wrote:
>>>> On Sun, Jun 14, 2020 at 6:03 PM Tom Honermann via SG16
>>>> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>>
>>>> wrote:
>>>>
>>>> On 6/14/20 4:57 PM, Corentin Jabot wrote:
>>>>>
>>>>>
>>>>> On Sun, 14 Jun 2020 at 22:45, Jens Maurer
>>>>> <Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>>
>>>>> wrote:
>>>>>
>>>>> On 14/06/2020 22.19, Corentin Jabot wrote:
>>>>> >
>>>>> >
>>>>> > On Sun, 14 Jun 2020 at 21:55, Jens Maurer
>>>>> <Jens.Maurer_at_[hidden]
>>>>> <mailto:Jens.Maurer_at_[hidden]>
>>>>> <mailto:Jens.Maurer_at_[hidden]
>>>>> <mailto:Jens.Maurer_at_[hidden]>>> wrote:
>>>>>
>>>>> > No, each code point in a sequence (given
>>>>> Unicode input) is a separate abstract character
>>>>> > in my view (after combining surrogate
>>>>> pairs, of course).
>>>>> >
>>>>> >
>>>>> > For example diatrics, when preceded by a
>>>>> letter are not considered abstract characters
>>>>> of their own.
>>>>>
>>>>> "Abstract character" is defined in
>>>>> https://www.unicode.org/glossary/ as follows:
>>>>>
>>>>> "A unit of information used for the
>>>>> organization, control, or representation of
>>>>> textual data."
>>>>> (ISO 10646 does not appear to have a
>>>>> definition in its clause 3.)
>>>>>
>>>>> I'm not seeing a conflict between that
>>>>> definition and my view that a diacritic,
>>>>> preceded by a letter, can be viewed as two
>>>>> different abstract characters.
>>>>> I agree that the alternate viewpoint "single
>>>>> abstract character" is not
>>>>> in conflict with the definition, either.
>>>>>
>>>>> What is your statement "are not considered
>>>>> abstract characters of their own"
>>>>> (which seems to leave little room for
>>>>> alternatives) based on?
>>>>>
>>>>>
>>>>> Right the glossary, is very much incomplete
>>>>>
>>>>> The definition is given in Unicode 13. 3.4 (
>>>>> http://www.unicode.org/versions/Unicode13.0.0/ch03.pdf )
>>>>>
>>>>>
>>>>> Abstract character: A unit of information used for
>>>>> the organization, control, or representation of
>>>>> textual data.
>>>>> • When representing data, the nature of that data
>>>>> is generally symbolic as
>>>>> opposed to some other kind of data (for example,
>>>>> aural or visual). Examples of
>>>>> such symbolic data include letters, ideographs,
>>>>> digits, punctuation, technical
>>>>> symbols, and dingbats.
>>>>> • An abstract character has no concrete form and
>>>>> should not be confused with a
>>>>> glyph.
>>>>> • An abstract character does not necessarily
>>>>> correspond to what a user thinks of
>>>>> as a “character” and should not be confused with a
>>>>> grapheme.
>>>>> • The abstract characters encoded by the Unicode
>>>>> Standard are known as Unicode abstract characters.
>>>>> *• Abstract characters not directly encoded by the
>>>>> Unicode Standard can often be represented by the
>>>>> use of combining character sequences.
>>>>> *
>>>>> *
>>>>> *
>>>> My reading of that aligns with Jens'
>>>> interpretation. An abstract character can be
>>>> composed from abstract characters. The emphasized
>>>> statement above appears to reaffirm that.
>>>>>
>>>>> The definition of encoded character is also
>>>>> informative
>>>>>
>>>>> Encoded character: An association (or mapping)
>>>>> between an abstract character and a code point.
>>>>> • An encoded character is also referred to as a
>>>>> coded character.
>>>>> • While an encoded character is formally defined
>>>>> in terms of the mapping
>>>>> between an abstract character and a code point,
>>>>> informally it can be thought of
>>>>> as an abstract character taken together with its
>>>>> assigned code point.
>>>>> • *Occasionally, for compatibility with other
>>>>> standards, a single abstract character
>>>>> may correspond to more than one code point—for
>>>>> example, “Å” corresponds
>>>>> both to U+00C5 Å latin capital letter a with ring
>>>>> above and to U+212B
>>>>> Å angstrom sign.
>>>>> • A single abstract character may also be
>>>>> represented by a sequence of code
>>>>> points—for example, latin capital letter g with
>>>>> acute may be represented by the
>>>>> sequence <U+0047 latin capital letter g, U+0301
>>>>> combining acute
>>>>> accent>, rather than being mapped to a single code
>>>>> point.*
>>>>>
>>>> Likewise here, these examples indicate that an
>>>> abstract character may have multiple encoded
>>>> representations, but I don't read this as
>>>> precluding the indicated code points reflecting
>>>> abstract characters on their own.
>>>>
>>>> It seems that, because we are not looking at a model
>>>> where we retain coded characters in their original form
>>>> for as long as possible, we're dealing with certain
>>>> issues in larger scopes than may be strictly necessary.
>>>> Are we sure that the same text processing should occur
>>>> for the entirety of the source? In other words, should
>>>> we consider more context-dependent (e.g., specific to
>>>> raw strings, specific to identifiers, etc.) text
>>>> processing?
>>>
>>> I've been thinking along those lines as well. I've been
>>> considering a model in which an
>>> /extended-source-character/ is introduced in phase 1 and
>>> then, in phase 3, all /extended-source-character/s
>>> outside of raw string literals are converted to
>>> /universal-character-name/s.
>>>
>>>
>>> I would really love it if someone could explain to me the
>>> value of introducing /universal-character-name/s (or
>>> /extended-source-character/) etc in the internal representation,
>>> instead of unicode codepoints, knowing that these
>>> things represent unicode codepoints.
>>
>> I find the UCN mechanism to be quite elegant. It
>> simultaneously accomplishes several things:
>>
>> 1. It allows the source language, as seen by all phases of
>> translation after phase 1 (except for the magical revert
>> for raw string literals) to be completely and abstractly
>> defined by characters defined in the standard. An
>> implementation's internal representation only needs to
>> differentiate 96 characters. I don't need to know how to
>> type, write, read, or pronounce ሴ; I just need to know
>> what \u1234 represents (abstractly; from a language
>> perspective, I probably don't care what character it denotes)
>>
>>
>> There is no difference between \u1234 and U+1234
> I haven't understood your direction as encoding the sequence of
> characters 'U', '+', '1', '2', '3', '4' where UCNs are produced
> today (if that were your direction, then I agree that the only
> difference here is spelling). I think you have a different mental
> model in mind in which the internal representation encodes what
> today are UCNs in some unspecified encoding form that does not
> serialize the code point value as text. I don't see what that
> fixes from a standard perspective. I think the current design is
> more elegant because it doesn't require translating explicitly
> written UCNs to code points encoded in some unspecified internal
> representation.
>
>
> My approach would be to not transform verbatim escape sequences until
> the tokenization in phase 3, such that we would not have to revert
> that operation for string literals
Verbatim escape sequences are never transformed at present (well, not
until they are encoded in literals at phase 5).
> universal escape sequence can then appear
> * in character literals
> * non raw string literals
> * pp-number, pp-identifier
> * header names
> * UDLs
>
> This avoid losing the distinction in phase 1 and keep phase 1 simple
What grammar production are you using to carry extended characters
through phase 1?
In phase 3, how do you intend to lex tokens before transforming extended
characters? There is a chicken/egg problem there.
Tom.
>
>
> On Mon, 15 Jun 2020 at 17:49, Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> On 6/15/20 4:18 AM, Corentin Jabot wrote:
>>
>>
>> On Mon, Jun 15, 2020, 08:40 Tom Honermann <tom_at_[hidden]
>> <mailto:tom_at_[hidden]>> wrote:
>>
>> On 6/14/20 6:53 PM, Corentin Jabot wrote:
>>>
>>>
>>> On Mon, 15 Jun 2020 at 00:36, Tom Honermann
>>> <tom_at_[hidden] <mailto:tom_at_[hidden]>> wrote:
>>>
>>> On 6/14/20 6:21 PM, Hubert Tong wrote:
>>>> On Sun, Jun 14, 2020 at 6:03 PM Tom Honermann via SG16
>>>> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>>
>>>> wrote:
>>>>
>>>> On 6/14/20 4:57 PM, Corentin Jabot wrote:
>>>>>
>>>>>
>>>>> On Sun, 14 Jun 2020 at 22:45, Jens Maurer
>>>>> <Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>>
>>>>> wrote:
>>>>>
>>>>> On 14/06/2020 22.19, Corentin Jabot wrote:
>>>>> >
>>>>> >
>>>>> > On Sun, 14 Jun 2020 at 21:55, Jens Maurer
>>>>> <Jens.Maurer_at_[hidden]
>>>>> <mailto:Jens.Maurer_at_[hidden]>
>>>>> <mailto:Jens.Maurer_at_[hidden]
>>>>> <mailto:Jens.Maurer_at_[hidden]>>> wrote:
>>>>>
>>>>> > No, each code point in a sequence (given
>>>>> Unicode input) is a separate abstract character
>>>>> > in my view (after combining surrogate
>>>>> pairs, of course).
>>>>> >
>>>>> >
>>>>> > For example diatrics, when preceded by a
>>>>> letter are not considered abstract characters
>>>>> of their own.
>>>>>
>>>>> "Abstract character" is defined in
>>>>> https://www.unicode.org/glossary/ as follows:
>>>>>
>>>>> "A unit of information used for the
>>>>> organization, control, or representation of
>>>>> textual data."
>>>>> (ISO 10646 does not appear to have a
>>>>> definition in its clause 3.)
>>>>>
>>>>> I'm not seeing a conflict between that
>>>>> definition and my view that a diacritic,
>>>>> preceded by a letter, can be viewed as two
>>>>> different abstract characters.
>>>>> I agree that the alternate viewpoint "single
>>>>> abstract character" is not
>>>>> in conflict with the definition, either.
>>>>>
>>>>> What is your statement "are not considered
>>>>> abstract characters of their own"
>>>>> (which seems to leave little room for
>>>>> alternatives) based on?
>>>>>
>>>>>
>>>>> Right the glossary, is very much incomplete
>>>>>
>>>>> The definition is given in Unicode 13. 3.4 (
>>>>> http://www.unicode.org/versions/Unicode13.0.0/ch03.pdf )
>>>>>
>>>>>
>>>>> Abstract character: A unit of information used for
>>>>> the organization, control, or representation of
>>>>> textual data.
>>>>> • When representing data, the nature of that data
>>>>> is generally symbolic as
>>>>> opposed to some other kind of data (for example,
>>>>> aural or visual). Examples of
>>>>> such symbolic data include letters, ideographs,
>>>>> digits, punctuation, technical
>>>>> symbols, and dingbats.
>>>>> • An abstract character has no concrete form and
>>>>> should not be confused with a
>>>>> glyph.
>>>>> • An abstract character does not necessarily
>>>>> correspond to what a user thinks of
>>>>> as a “character” and should not be confused with a
>>>>> grapheme.
>>>>> • The abstract characters encoded by the Unicode
>>>>> Standard are known as Unicode abstract characters.
>>>>> *• Abstract characters not directly encoded by the
>>>>> Unicode Standard can often be represented by the
>>>>> use of combining character sequences.
>>>>> *
>>>>> *
>>>>> *
>>>> My reading of that aligns with Jens'
>>>> interpretation. An abstract character can be
>>>> composed from abstract characters. The emphasized
>>>> statement above appears to reaffirm that.
>>>>>
>>>>> The definition of encoded character is also
>>>>> informative
>>>>>
>>>>> Encoded character: An association (or mapping)
>>>>> between an abstract character and a code point.
>>>>> • An encoded character is also referred to as a
>>>>> coded character.
>>>>> • While an encoded character is formally defined
>>>>> in terms of the mapping
>>>>> between an abstract character and a code point,
>>>>> informally it can be thought of
>>>>> as an abstract character taken together with its
>>>>> assigned code point.
>>>>> • *Occasionally, for compatibility with other
>>>>> standards, a single abstract character
>>>>> may correspond to more than one code point—for
>>>>> example, “Å” corresponds
>>>>> both to U+00C5 Å latin capital letter a with ring
>>>>> above and to U+212B
>>>>> Å angstrom sign.
>>>>> • A single abstract character may also be
>>>>> represented by a sequence of code
>>>>> points—for example, latin capital letter g with
>>>>> acute may be represented by the
>>>>> sequence <U+0047 latin capital letter g, U+0301
>>>>> combining acute
>>>>> accent>, rather than being mapped to a single code
>>>>> point.*
>>>>>
>>>> Likewise here, these examples indicate that an
>>>> abstract character may have multiple encoded
>>>> representations, but I don't read this as
>>>> precluding the indicated code points reflecting
>>>> abstract characters on their own.
>>>>
>>>> It seems that, because we are not looking at a model
>>>> where we retain coded characters in their original form
>>>> for as long as possible, we're dealing with certain
>>>> issues in larger scopes than may be strictly necessary.
>>>> Are we sure that the same text processing should occur
>>>> for the entirety of the source? In other words, should
>>>> we consider more context-dependent (e.g., specific to
>>>> raw strings, specific to identifiers, etc.) text
>>>> processing?
>>>
>>> I've been thinking along those lines as well. I've been
>>> considering a model in which an
>>> /extended-source-character/ is introduced in phase 1 and
>>> then, in phase 3, all /extended-source-character/s
>>> outside of raw string literals are converted to
>>> /universal-character-name/s.
>>>
>>>
>>> I would really love it if someone could explain to me the
>>> value of introducing /universal-character-name/s (or
>>> /extended-source-character/) etc in the internal representation,
>>> instead of unicode codepoints, knowing that these
>>> things represent unicode codepoints.
>>
>> I find the UCN mechanism to be quite elegant. It
>> simultaneously accomplishes several things:
>>
>> 1. It allows the source language, as seen by all phases of
>> translation after phase 1 (except for the magical revert
>> for raw string literals) to be completely and abstractly
>> defined by characters defined in the standard. An
>> implementation's internal representation only needs to
>> differentiate 96 characters. I don't need to know how to
>> type, write, read, or pronounce ሴ; I just need to know
>> what \u1234 represents (abstractly; from a language
>> perspective, I probably don't care what character it denotes)
>>
>>
>> There is no difference between \u1234 and U+1234
> I haven't understood your direction as encoding the sequence of
> characters 'U', '+', '1', '2', '3', '4' where UCNs are produced
> today (if that were your direction, then I agree that the only
> difference here is spelling). I think you have a different mental
> model in mind in which the internal representation encodes what
> today are UCNs in some unspecified encoding form that does not
> serialize the code point value as text. I don't see what that
> fixes from a standard perspective. I think the current design is
> more elegant because it doesn't require translating explicitly
> written UCNs to code points encoded in some unspecified internal
> representation.
>
>
> My approach would be to not transform verbatim escape sequences until
> the tokenization in phase 3, such that we would not have to revert
> that operation for string literals
Verbatim escape sequences are never transformed at present (well, not
until they are encoded in literals at phase 5).
> universal escape sequence can then appear
> * in character literals
> * non raw string literals
> * pp-number, pp-identifier
> * header names
> * UDLs
>
> This avoid losing the distinction in phase 1 and keep phase 1 simple
What grammar production are you using to carry extended characters
through phase 1?
In phase 3, how do you intend to lex tokens before transforming extended
characters? There is a chicken/egg problem there.
Tom.
Received on 2020-06-15 11:55:54