sg16: Re: [SG16] Redefining Lexing in terms of Unicode

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Thu, 28 May 2020 15:28:19 +0200

On Thu, 28 May 2020 at 14:36, Steve Downey <sdowney_at_[hidden]> wrote:

> I believe that the cost of universal-character-names can be minimized if
> they vanish in phase1. It's merely an escape sequence that is allowed to
> exist outside literals that is immediately translated to a codepoint.
> I think the revert mechanism is also a bit broken. Perhaps instead
> requiring a view of the original text that corresponds to the token? That
> is describe the mechanism that actually exists, even in lex and yacc.
>

I agree with that.
The standard is trying to have sequential distinct phases while
implementation apply different lexing rules contextually..

(I still question the value of universal-character-names, we should
investigate whether they are ever used in production code)

>
>
> On Thu, May 28, 2020, 05:00 Corentin Jabot via SG16 <sg16_at_[hidden]>
> wrote:
>
>>
>>
>> On Thu, 28 May 2020 at 10:51, Alisdair Meredith <alisdairm_at_[hidden]> wrote:
>>
>>> Sorry for being slow, but could you explain what you mean
>>> by reification?
>>>
>>
>> Haha no reason to be sorry at all!
>>
>> The reflection proposals, notably https://wg21.link/p1240r1 have a
>> mechanism to go from a string to an identifier.
>> the proposed syntax seems to be [: "foo" :].
>> (more generally reification is the reverse operation from reflection)
>>
>> since \u, \U, are valid in strings, we could use that mechanism to
>> constructs identifiers that are not re-presentable in the physical
>> character set.
>> It is not strictly equivalent to universal character names as universal
>> characters names can appear in macro names whereas this would be limited to
>> C++ identifiers
>>
>>
>>
>>>
>>> AlisdairM
>>>
>>> On May 28, 2020, at 09:49, Corentin Jabot <corentinjabot_at_[hidden]>
>>> wrote:
>>>
>>>
>>>
>>> On Thu, 28 May 2020 at 10:40, Alisdair Meredith via SG16 <
>>> sg16_at_[hidden]> wrote:
>>>
>>>> To be clear that I understand your intent:
>>>> If I am working in platform A, and have a 3rd party API supplied
>>>> by vendor B - if vendor B uses code-points that I cannot express
>>>> directly in the code pages available in my development
>>>> environment, then I will no longer have the escape hatch of using
>>>> escape sequences to use/wrap that API, and can no longer use it?
>>>>
>>>
>>> Yes, that is the intent. or you would be able to express it through
>>> mechanism such as reification.
>>>
>>>
>>>>
>>>> Or is the intent that my vendor must find an implementation defined
>>>> way of describing every code point, rather than relying on the
>>>> portable one defined in the standard?
>>>>
>>>
>>> Nope, that would be worse than the status quo
>>>
>>>
>>>>
>>>> Or that they must support only code pages that can represent all
>>>> valid unicode identifiers, no implementation defined extensions in
>>>> phase 1 at all?
>>>>
>>>
>>> Nope, that would break a lot of existing code.
>>>
>>>
>>>>
>>>> AlisdairM
>>>>
>>>> On May 28, 2020, at 09:04, Corentin via SG16 <sg16_at_[hidden]>
>>>> wrote:
>>>>
>>>> Hello.
>>>> Following some Twitter discussions with Tom, Alisdair and Steve, I
>>>> would like to propose that lexing should be redefined in terms of Unicode.
>>>> This would be mostly a wording change with limited effect on
>>>> implementations and existing code.
>>>>
>>>> Current state of affair:
>>>>
>>>> Any character not in the basic source character set is converted to a
>>>> universal character name \uxxxx, whose values map 1-1 to unicode code points
>>>> The execution character set is defined in terms of the basic source
>>>> character set
>>>> \u and \U sequences can appear in identifiers and strings
>>>> \u and \U sequences are reverted in raw string literals.
>>>>
>>>>
>>>> Proposed, broad strokes
>>>>
>>>>
>>>> - In phase 1, Abstract physical characters are mapped 1-1 to a
>>>> sequence of Unicode code points that represent these characters, such that
>>>> the internal representation and the physical source represent the same
>>>> sequence of abstract characters. This tightens what
>>>> transformations implementers can do in phase 1
>>>> - Additionally in phase 1, we want to mandate that compiler support
>>>> source files that are utf8-encoded (aka there must exist some mechanism for
>>>> the compiler to accept such physical source files, it doesn't need to be
>>>> the only supported format or even the default)
>>>> - The internal representation is a sequence of Unicode codepoint,
>>>> but the way it is represented or stored is not specified. This would still
>>>> allow implementations to store code-points as \uxxxx if they so desired.
>>>> - The notion of universal character name is removed, the wording
>>>> would consistently refer to Unicode code points
>>>> - \u and \U sequences are redefined as escape sequences for string
>>>> and character literals.
>>>> - raw string literals would only require reverting line splitting
>>>> - The basic execution character sets (narrow and wide) are
>>>> redefined such that they don't depend on the definition of basic
>>>> source character set - but they remained unchanged
>>>> - The notion of basic source character set is removed
>>>> - Source character set is redefined as being the Unicode character
>>>> set
>>>> - The grammar of identifier would be redefined in terms of
>>>> XID_Start + _ and XID_Continue, pending P1949 approval
>>>>
>>>>
>>>> The intent with these changes is to limit modifications to the behavior
>>>> or implementation of existing implementations, there is however a breaking
>>>> behavior change
>>>>
>>>> *Identifiers which contain \u or \U escape sequences would become
>>>> ill-formed since with these new rules \u and \U can only appear in string
>>>> and characters literals.*
>>>>
>>>> I suggest that either
>>>> - We make such identifier ill-formed
>>>> - We make such identifier deprecated.
>>>>
>>>> The reason is that this feature is not well-motivated (such identifier
>>>> exists for compatibility between files of different encoding which cannot
>>>> represent the same characters but in practice
>>>> can only be used on extern identifiers and identifiers declared in
>>>> modules or imported headers as most implementations do not provide a
>>>> per-header encoding selection mechanism), and
>>>> is hard to use properly (such identifiers are indeed hardly readable)
>>>>
>>>> The same result could be achieve with a reification operator such as
>>>> proposed by P1240, ie: [: "foo\u0300" :] = 42;
>>>>
>>>> The hope is that these changes would make it less confusing for any one
>>>> involve how lexing is perform.
>>>> I do expect this effort to be rather involved, (and I am terrible at
>>>> wording).
>>>> What do you think?
>>>> Any one willing to help over the next couple of years?
>>>>
>>>>
>>>> Cheers,
>>>>
>>>> Corentin
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> SG16 mailing list
>>>> SG16_at_[hidden]
>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>>
>>>>
>>>> --
>>>> SG16 mailing list
>>>> SG16_at_[hidden]
>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>>
>>>
>>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>

Received on 2020-05-28 08:31:36