C++ Logo

sg16

Advanced search

Re: [SG16] Redefining Lexing in terms of Unicode

From: Steve Downey <sdowney_at_[hidden]>
Date: Thu, 28 May 2020 08:36:13 -0400
I believe that the cost of universal-character-names can be minimized if
they vanish in phase1. It's merely an escape sequence that is allowed to
exist outside literals that is immediately translated to a codepoint.
I think the revert mechanism is also a bit broken. Perhaps instead
requiring a view of the original text that corresponds to the token? That
is describe the mechanism that actually exists, even in lex and yacc.


On Thu, May 28, 2020, 05:00 Corentin Jabot via SG16 <sg16_at_[hidden]>
wrote:

>
>
> On Thu, 28 May 2020 at 10:51, Alisdair Meredith <alisdairm_at_[hidden]> wrote:
>
>> Sorry for being slow, but could you explain what you mean
>> by reification?
>>
>
> Haha no reason to be sorry at all!
>
> The reflection proposals, notably https://wg21.link/p1240r1 have a
> mechanism to go from a string to an identifier.
> the proposed syntax seems to be [: "foo" :].
> (more generally reification is the reverse operation from reflection)
>
> since \u, \U, are valid in strings, we could use that mechanism to
> constructs identifiers that are not re-presentable in the physical
> character set.
> It is not strictly equivalent to universal character names as universal
> characters names can appear in macro names whereas this would be limited to
> C++ identifiers
>
>
>
>>
>> AlisdairM
>>
>> On May 28, 2020, at 09:49, Corentin Jabot <corentinjabot_at_[hidden]>
>> wrote:
>>
>>
>>
>> On Thu, 28 May 2020 at 10:40, Alisdair Meredith via SG16 <
>> sg16_at_[hidden]> wrote:
>>
>>> To be clear that I understand your intent:
>>> If I am working in platform A, and have a 3rd party API supplied
>>> by vendor B - if vendor B uses code-points that I cannot express
>>> directly in the code pages available in my development
>>> environment, then I will no longer have the escape hatch of using
>>> escape sequences to use/wrap that API, and can no longer use it?
>>>
>>
>> Yes, that is the intent. or you would be able to express it through
>> mechanism such as reification.
>>
>>
>>>
>>> Or is the intent that my vendor must find an implementation defined
>>> way of describing every code point, rather than relying on the
>>> portable one defined in the standard?
>>>
>>
>> Nope, that would be worse than the status quo
>>
>>
>>>
>>> Or that they must support only code pages that can represent all
>>> valid unicode identifiers, no implementation defined extensions in
>>> phase 1 at all?
>>>
>>
>> Nope, that would break a lot of existing code.
>>
>>
>>>
>>> AlisdairM
>>>
>>> On May 28, 2020, at 09:04, Corentin via SG16 <sg16_at_[hidden]>
>>> wrote:
>>>
>>> Hello.
>>> Following some Twitter discussions with Tom, Alisdair and Steve, I would
>>> like to propose that lexing should be redefined in terms of Unicode.
>>> This would be mostly a wording change with limited effect on
>>> implementations and existing code.
>>>
>>> Current state of affair:
>>>
>>> Any character not in the basic source character set is converted to a
>>> universal character name \uxxxx, whose values map 1-1 to unicode code points
>>> The execution character set is defined in terms of the basic source
>>> character set
>>> \u and \U sequences can appear in identifiers and strings
>>> \u and \U sequences are reverted in raw string literals.
>>>
>>>
>>> Proposed, broad strokes
>>>
>>>
>>> - In phase 1, Abstract physical characters are mapped 1-1 to a
>>> sequence of Unicode code points that represent these characters, such that
>>> the internal representation and the physical source represent the same
>>> sequence of abstract characters. This tightens what
>>> transformations implementers can do in phase 1
>>> - Additionally in phase 1, we want to mandate that compiler support
>>> source files that are utf8-encoded (aka there must exist some mechanism for
>>> the compiler to accept such physical source files, it doesn't need to be
>>> the only supported format or even the default)
>>> - The internal representation is a sequence of Unicode codepoint,
>>> but the way it is represented or stored is not specified. This would still
>>> allow implementations to store code-points as \uxxxx if they so desired.
>>> - The notion of universal character name is removed, the wording
>>> would consistently refer to Unicode code points
>>> - \u and \U sequences are redefined as escape sequences for string
>>> and character literals.
>>> - raw string literals would only require reverting line splitting
>>> - The basic execution character sets (narrow and wide) are redefined
>>> such that they don't depend on the definition of basic source character set
>>> - but they remained unchanged
>>> - The notion of basic source character set is removed
>>> - Source character set is redefined as being the Unicode character
>>> set
>>> - The grammar of identifier would be redefined in terms of
>>> XID_Start + _ and XID_Continue, pending P1949 approval
>>>
>>>
>>> The intent with these changes is to limit modifications to the behavior
>>> or implementation of existing implementations, there is however a breaking
>>> behavior change
>>>
>>> *Identifiers which contain \u or \U escape sequences would become
>>> ill-formed since with these new rules \u and \U can only appear in string
>>> and characters literals.*
>>>
>>> I suggest that either
>>> - We make such identifier ill-formed
>>> - We make such identifier deprecated.
>>>
>>> The reason is that this feature is not well-motivated (such identifier
>>> exists for compatibility between files of different encoding which cannot
>>> represent the same characters but in practice
>>> can only be used on extern identifiers and identifiers declared in
>>> modules or imported headers as most implementations do not provide a
>>> per-header encoding selection mechanism), and
>>> is hard to use properly (such identifiers are indeed hardly readable)
>>>
>>> The same result could be achieve with a reification operator such as
>>> proposed by P1240, ie: [: "foo\u0300" :] = 42;
>>>
>>> The hope is that these changes would make it less confusing for any one
>>> involve how lexing is perform.
>>> I do expect this effort to be rather involved, (and I am terrible at
>>> wording).
>>> What do you think?
>>> Any one willing to help over the next couple of years?
>>>
>>>
>>> Cheers,
>>>
>>> Corentin
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> SG16 mailing list
>>> SG16_at_[hidden]
>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>
>>>
>>> --
>>> SG16 mailing list
>>> SG16_at_[hidden]
>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>
>>
>> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2020-05-28 07:39:29