I believe that the cost of universal-character-names can be minimized if they vanish in phase1. It's merely an escape sequence that is allowed to exist outside literals that is immediately translated to a codepoint. 
I think the revert mechanism is also a bit broken.  Perhaps instead requiring a view of the original text that corresponds to the token? That is describe the mechanism that actually exists, even in lex and yacc. 
 

On Thu, May 28, 2020, 05:00 Corentin Jabot via SG16 <sg16@lists.isocpp.org> wrote:


On Thu, 28 May 2020 at 10:51, Alisdair Meredith <alisdairm@me.com> wrote:
Sorry for being slow, but could you explain what you mean
by reification?

Haha no reason to be sorry at all!

The reflection proposals, notably https://wg21.link/p1240r1 have a mechanism to go from a string to an identifier.
the proposed syntax seems to be [: "foo" :]. 
(more generally reification is the reverse operation from reflection)

since \u, \U, are valid in strings, we could use that mechanism to constructs identifiers that are not re-presentable in the physical character set.
It is not strictly equivalent to universal character names as universal characters names can appear in macro names whereas this would be limited to C++ identifiers

 

AlisdairM

On May 28, 2020, at 09:49, Corentin Jabot <corentinjabot@gmail.com> wrote:



On Thu, 28 May 2020 at 10:40, Alisdair Meredith via SG16 <sg16@lists.isocpp.org> wrote:
To be clear that I understand your intent:
If I am working in platform A, and have a 3rd party API supplied
by vendor B - if vendor B uses code-points that I cannot express
directly in the code pages available in my development
environment, then I will no longer have the escape hatch of using
escape sequences to use/wrap that API, and can no longer use it?

Yes, that is the intent. or you would be able to express it through mechanism such as reification.
 

Or is the intent that my vendor must find an implementation defined
way of describing every code point, rather than relying on the
portable one defined in the standard?

Nope, that would be worse than the status quo
 

Or that they must support only code pages that can represent all
valid unicode identifiers, no implementation defined extensions in
phase 1 at all?

Nope, that would break a lot of existing code.
 

AlisdairM

On May 28, 2020, at 09:04, Corentin via SG16 <sg16@lists.isocpp.org> wrote:

Hello.
Following some Twitter discussions with Tom, Alisdair and Steve, I would like to propose that lexing should be redefined in terms of Unicode.
This would be mostly a wording change with limited effect on implementations and existing code.

Current state of affair:

Any character not in the basic source character set is converted to a universal character name \uxxxx, whose values map 1-1 to unicode code points
The execution character set is defined in terms of  the basic source character set
\u and \U sequences can appear in identifiers and strings
\u and \U sequences are reverted in raw string literals.


Proposed, broad strokes

  • In phase 1, Abstract physical characters are mapped 1-1 to a sequence of Unicode code points that represent these characters, such that the internal representation and the physical source represent the same sequence of abstract characters. This tightens what transformations implementers can do in phase 1 
  • Additionally in phase 1, we want to mandate that compiler support source files that are utf8-encoded (aka there must exist some mechanism for the compiler to accept such physical source files, it doesn't need to be the only supported format or even the default)
  • The internal representation is a sequence of Unicode codepoint, but the way it is represented or stored is not specified. This would still allow implementations to store code-points as \uxxxx if they so desired.
  • The notion of universal character name is removed, the wording would consistently refer to Unicode code points
  • \u and \U sequences are redefined as escape sequences for string and character literals.
  • raw string literals would only require reverting line splitting
  • The basic execution character sets (narrow and wide) are redefined such that they don't depend on the definition of basic source character set - but they remained unchanged
  • The notion of basic source character set is removed
  • Source character set is redefined as being the Unicode character set
  • The grammar of identifier would be redefined in terms of XID_Start + _ and XID_Continue, pending P1949 approval

The intent with these changes is to limit modifications to the behavior or implementation of existing implementations, there is however a breaking behavior change

Identifiers which contain \u or \U escape sequences would become ill-formed since with these new rules \u and \U can only appear in string and characters literals.

I suggest that either
- We make such identifier ill-formed
- We make such identifier deprecated. 

The reason is that this feature is not well-motivated (such identifier exists for compatibility between files of different encoding which cannot represent the same characters but in practice 
can only be used on extern identifiers and identifiers declared in modules or imported headers as most implementations do not provide a per-header encoding selection mechanism), and
is hard to use properly (such identifiers are indeed hardly readable)

The same result could be achieve with a reification operator such as proposed by P1240, ie: [: "foo\u0300" :] = 42;

The hope is that these changes would make it less confusing for any one involve how lexing is perform.
I do expect this effort to be rather involved, (and I am terrible at wording).
What do you think?
Any one willing to help over the next couple of years?


Cheers,

Corentin







--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16

--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16

--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16