On Mon, 15 Jun 2020 at 18:52, Tom Honermann <tom@honermann.net> wrote:

On 6/15/20 12:17 PM, Corentin Jabot wrote:

On Mon, 15 Jun 2020 at 17:49, Tom Honermann <tom@honermann.net> wrote:

On 6/15/20 4:18 AM, Corentin Jabot wrote:

On Mon, Jun 15, 2020, 08:40 Tom Honermann <tom@honermann.net> wrote:

On 6/14/20 6:53 PM, Corentin Jabot wrote:

On Mon, 15 Jun 2020 at 00:36, Tom Honermann <tom@honermann.net> wrote:

On 6/14/20 6:21 PM, Hubert Tong wrote:

On Sun, Jun 14, 2020 at 6:03 PM Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:

On 6/14/20 4:57 PM, Corentin Jabot wrote:

On Sun, 14 Jun 2020 at 22:45, Jens Maurer <Jens.Maurer@gmx.net> wrote:

On 14/06/2020 22.19, Corentin Jabot wrote:
>
>
> On Sun, 14 Jun 2020 at 21:55, Jens Maurer <Jens.Maurer@gmx.net <mailto:Jens.Maurer@gmx.net>> wrote:

> No, each code point in a sequence (given Unicode input) is a separate abstract character
> in my view (after combining surrogate pairs, of course).
>
>
> For example diatrics, when preceded by a letter are not considered abstract characters of their own.

"Abstract character" is defined in https://www.unicode.org/glossary/ as follows:

"A unit of information used for the organization, control, or representation of textual data."
(ISO 10646 does not appear to have a definition in its clause 3.)

I'm not seeing a conflict between that definition and my view that a diacritic,
preceded by a letter, can be viewed as two different abstract characters.
I agree that the alternate viewpoint "single abstract character" is not
in conflict with the definition, either.

What is your statement "are not considered abstract characters of their own"
(which seems to leave little room for alternatives) based on?

Right the glossary, is very much incomplete

The definition is given in Unicode 13. 3.4 ( http://www.unicode.org/versions/Unicode13.0.0/ch03.pdf )

Abstract character: A unit of information used for the organization, control, or representation of textual data.
• When representing data, the nature of that data is generally symbolic as
opposed to some other kind of data (for example, aural or visual). Examples of
such symbolic data include letters, ideographs, digits, punctuation, technical
symbols, and dingbats.
• An abstract character has no concrete form and should not be confused with a
glyph.
• An abstract character does not necessarily correspond to what a user thinks of
as a “character” and should not be confused with a grapheme.
• The abstract characters encoded by the Unicode Standard are known as Unicode abstract characters.
• Abstract characters not directly encoded by the Unicode Standard can often be represented by the use of combining character sequences.

My reading of that aligns with Jens' interpretation. An abstract character can be composed from abstract characters. The emphasized statement above appears to reaffirm that.

The definition of encoded character is also informative

Encoded character: An association (or mapping) between an abstract character and a code point.
• An encoded character is also referred to as a coded character.
• While an encoded character is formally defined in terms of the mapping
between an abstract character and a code point, informally it can be thought of
as an abstract character taken together with its assigned code point.
• Occasionally, for compatibility with other standards, a single abstract character
may correspond to more than one code point—for example, “Å” corresponds
both to U+00C5 Å latin capital letter a with ring above and to U+212B
Å angstrom sign.
• A single abstract character may also be represented by a sequence of code
points—for example, latin capital letter g with acute may be represented by the
sequence <U+0047 latin capital letter g, U+0301 combining acute
accent>, rather than being mapped to a single code point.

Likewise here, these examples indicate that an abstract character may have multiple encoded representations, but I don't read this as precluding the indicated code points reflecting abstract characters on their own.

It seems that, because we are not looking at a model where we retain coded characters in their original form for as long as possible, we're dealing with certain issues in larger scopes than may be strictly necessary. Are we sure that the same text processing should occur for the entirety of the source? In other words, should we consider more context-dependent (e.g., specific to raw strings, specific to identifiers, etc.) text processing?

I've been thinking along those lines as well. I've been considering a model in which an extended-source-character is introduced in phase 1 and then, in phase 3, all extended-source-characters outside of raw string literals are converted to universal-character-names.

I would really love it if someone could explain to me the value of introducing universal-character-names (or extended-source-character) etc in the internal representation,

instead of unicode codepoints, knowing that these things represent unicode codepoints.

I find the UCN mechanism to be quite elegant. It simultaneously accomplishes several things:

It allows the source language, as seen by all phases of translation after phase 1 (except for the magical revert for raw string literals) to be completely and abstractly defined by characters defined in the standard. An implementation's internal representation only needs to differentiate 96 characters. I don't need to know how to type, write, read, or pronounce ሴ; I just need to know what \u1234 represents (abstractly; from a language perspective, I probably don't care what character it denotes)

There is no difference between \u1234 and U+1234

I haven't understood your direction as encoding the sequence of characters 'U', '+', '1', '2', '3', '4' where UCNs are produced today (if that were your direction, then I agree that the only difference here is spelling). I think you have a different mental model in mind in which the internal representation encodes what today are UCNs in some unspecified encoding form that does not serialize the code point value as text. I don't see what that fixes from a standard perspective. I think the current design is more elegant because it doesn't require translating explicitly written UCNs to code points encoded in some unspecified internal representation.

My approach would be to not transform verbatim escape sequences until the tokenization in phase 3, such that we would not have to revert that operation for string literals

Verbatim escape sequences are never transformed at present (well, not until they are encoded in literals at phase 5).

oups

/transform/form/

universal escape sequence can then appear

* in character literals

* non raw string literals

* pp-number, pp-identifier

* header names

* UDLs

This avoid losing the distinction in phase 1 and keep phase 1 simple

What grammar production are you using to carry extended characters through phase 1?

In phase 1 we only convert from one character set (source) to another (internal). thee internal character set can represent any unicode character (which makes it the unicode character set)

In phase 3, how do you intend to lex tokens before transforming extended characters? There is a chicken/egg problem there.

In phase 3, you have characters, which you can lex as you do today.

If you have an an universal name character name escape sequence you handle it contextually if you expect one, ignore it in raw literals, etc

if you have a unicode code point whose value you do not expect, the program is ill formed (which is already the case)

The only difference is that instead of distinguishing basic characters from extended characters, there is only codepoints which are treated according to the grammar rules

Tom.