C++ Logo

sg16

Advanced search

Re: [SG16] Redefining Lexing in terms of Unicode

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Thu, 28 May 2020 11:00:16 +0200
On Thu, 28 May 2020 at 10:51, Alisdair Meredith <alisdairm_at_[hidden]> wrote:

> Sorry for being slow, but could you explain what you mean
> by reification?
>

Haha no reason to be sorry at all!

The reflection proposals, notably https://wg21.link/p1240r1 have a
mechanism to go from a string to an identifier.
the proposed syntax seems to be [: "foo" :].
(more generally reification is the reverse operation from reflection)

since \u, \U, are valid in strings, we could use that mechanism to
constructs identifiers that are not re-presentable in the physical
character set.
It is not strictly equivalent to universal character names as universal
characters names can appear in macro names whereas this would be limited to
C++ identifiers



>
> AlisdairM
>
> On May 28, 2020, at 09:49, Corentin Jabot <corentinjabot_at_[hidden]> wrote:
>
>
>
> On Thu, 28 May 2020 at 10:40, Alisdair Meredith via SG16 <
> sg16_at_[hidden]> wrote:
>
>> To be clear that I understand your intent:
>> If I am working in platform A, and have a 3rd party API supplied
>> by vendor B - if vendor B uses code-points that I cannot express
>> directly in the code pages available in my development
>> environment, then I will no longer have the escape hatch of using
>> escape sequences to use/wrap that API, and can no longer use it?
>>
>
> Yes, that is the intent. or you would be able to express it through
> mechanism such as reification.
>
>
>>
>> Or is the intent that my vendor must find an implementation defined
>> way of describing every code point, rather than relying on the
>> portable one defined in the standard?
>>
>
> Nope, that would be worse than the status quo
>
>
>>
>> Or that they must support only code pages that can represent all
>> valid unicode identifiers, no implementation defined extensions in
>> phase 1 at all?
>>
>
> Nope, that would break a lot of existing code.
>
>
>>
>> AlisdairM
>>
>> On May 28, 2020, at 09:04, Corentin via SG16 <sg16_at_[hidden]>
>> wrote:
>>
>> Hello.
>> Following some Twitter discussions with Tom, Alisdair and Steve, I would
>> like to propose that lexing should be redefined in terms of Unicode.
>> This would be mostly a wording change with limited effect on
>> implementations and existing code.
>>
>> Current state of affair:
>>
>> Any character not in the basic source character set is converted to a
>> universal character name \uxxxx, whose values map 1-1 to unicode code points
>> The execution character set is defined in terms of the basic source
>> character set
>> \u and \U sequences can appear in identifiers and strings
>> \u and \U sequences are reverted in raw string literals.
>>
>>
>> Proposed, broad strokes
>>
>>
>> - In phase 1, Abstract physical characters are mapped 1-1 to a
>> sequence of Unicode code points that represent these characters, such that
>> the internal representation and the physical source represent the same
>> sequence of abstract characters. This tightens what
>> transformations implementers can do in phase 1
>> - Additionally in phase 1, we want to mandate that compiler support
>> source files that are utf8-encoded (aka there must exist some mechanism for
>> the compiler to accept such physical source files, it doesn't need to be
>> the only supported format or even the default)
>> - The internal representation is a sequence of Unicode codepoint, but
>> the way it is represented or stored is not specified. This would still
>> allow implementations to store code-points as \uxxxx if they so desired.
>> - The notion of universal character name is removed, the wording
>> would consistently refer to Unicode code points
>> - \u and \U sequences are redefined as escape sequences for string
>> and character literals.
>> - raw string literals would only require reverting line splitting
>> - The basic execution character sets (narrow and wide) are redefined
>> such that they don't depend on the definition of basic source character set
>> - but they remained unchanged
>> - The notion of basic source character set is removed
>> - Source character set is redefined as being the Unicode character set
>> - The grammar of identifier would be redefined in terms of
>> XID_Start + _ and XID_Continue, pending P1949 approval
>>
>>
>> The intent with these changes is to limit modifications to the behavior
>> or implementation of existing implementations, there is however a breaking
>> behavior change
>>
>> *Identifiers which contain \u or \U escape sequences would become
>> ill-formed since with these new rules \u and \U can only appear in string
>> and characters literals.*
>>
>> I suggest that either
>> - We make such identifier ill-formed
>> - We make such identifier deprecated.
>>
>> The reason is that this feature is not well-motivated (such identifier
>> exists for compatibility between files of different encoding which cannot
>> represent the same characters but in practice
>> can only be used on extern identifiers and identifiers declared in
>> modules or imported headers as most implementations do not provide a
>> per-header encoding selection mechanism), and
>> is hard to use properly (such identifiers are indeed hardly readable)
>>
>> The same result could be achieve with a reification operator such as
>> proposed by P1240, ie: [: "foo\u0300" :] = 42;
>>
>> The hope is that these changes would make it less confusing for any one
>> involve how lexing is perform.
>> I do expect this effort to be rather involved, (and I am terrible at
>> wording).
>> What do you think?
>> Any one willing to help over the next couple of years?
>>
>>
>> Cheers,
>>
>> Corentin
>>
>>
>>
>>
>>
>>
>>
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>>
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>
>

Received on 2020-05-28 04:03:33