sg16: Re: [SG16] Redefining Lexing in terms of Unicode

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Thu, 28 May 2020 10:49:40 +0200

On Thu, 28 May 2020 at 10:40, Alisdair Meredith via SG16 <
sg16_at_[hidden]> wrote:

> To be clear that I understand your intent:
> If I am working in platform A, and have a 3rd party API supplied
> by vendor B - if vendor B uses code-points that I cannot express
> directly in the code pages available in my development
> environment, then I will no longer have the escape hatch of using
> escape sequences to use/wrap that API, and can no longer use it?
>

Yes, that is the intent. or you would be able to express it through
mechanism such as reification.

>
> Or is the intent that my vendor must find an implementation defined
> way of describing every code point, rather than relying on the
> portable one defined in the standard?
>

Nope, that would be worse than the status quo

>
> Or that they must support only code pages that can represent all
> valid unicode identifiers, no implementation defined extensions in
> phase 1 at all?
>

Nope, that would break a lot of existing code.

>
> AlisdairM
>
> On May 28, 2020, at 09:04, Corentin via SG16 <sg16_at_[hidden]>
> wrote:
>
> Hello.
> Following some Twitter discussions with Tom, Alisdair and Steve, I would
> like to propose that lexing should be redefined in terms of Unicode.
> This would be mostly a wording change with limited effect on
> implementations and existing code.
>
> Current state of affair:
>
> Any character not in the basic source character set is converted to a
> universal character name \uxxxx, whose values map 1-1 to unicode code points
> The execution character set is defined in terms of the basic source
> character set
> \u and \U sequences can appear in identifiers and strings
> \u and \U sequences are reverted in raw string literals.
>
>
> Proposed, broad strokes
>
>
> - In phase 1, Abstract physical characters are mapped 1-1 to a
> sequence of Unicode code points that represent these characters, such that
> the internal representation and the physical source represent the same
> sequence of abstract characters. This tightens what
> transformations implementers can do in phase 1
> - Additionally in phase 1, we want to mandate that compiler support
> source files that are utf8-encoded (aka there must exist some mechanism for
> the compiler to accept such physical source files, it doesn't need to be
> the only supported format or even the default)
> - The internal representation is a sequence of Unicode codepoint, but
> the way it is represented or stored is not specified. This would still
> allow implementations to store code-points as \uxxxx if they so desired.
> - The notion of universal character name is removed, the wording would
> consistently refer to Unicode code points
> - \u and \U sequences are redefined as escape sequences for string and
> character literals.
> - raw string literals would only require reverting line splitting
> - The basic execution character sets (narrow and wide) are redefined
> such that they don't depend on the definition of basic source character set
> - but they remained unchanged
> - The notion of basic source character set is removed
> - Source character set is redefined as being the Unicode character set
> - The grammar of identifier would be redefined in terms of XID_Start +
> _ and XID_Continue, pending P1949 approval
>
>
> The intent with these changes is to limit modifications to the behavior or
> implementation of existing implementations, there is however a breaking
> behavior change
>
> *Identifiers which contain \u or \U escape sequences would become
> ill-formed since with these new rules \u and \U can only appear in string
> and characters literals.*
>
> I suggest that either
> - We make such identifier ill-formed
> - We make such identifier deprecated.
>
> The reason is that this feature is not well-motivated (such identifier
> exists for compatibility between files of different encoding which cannot
> represent the same characters but in practice
> can only be used on extern identifiers and identifiers declared in modules
> or imported headers as most implementations do not provide a per-header
> encoding selection mechanism), and
> is hard to use properly (such identifiers are indeed hardly readable)
>
> The same result could be achieve with a reification operator such as
> proposed by P1240, ie: [: "foo\u0300" :] = 42;
>
> The hope is that these changes would make it less confusing for any one
> involve how lexing is perform.
> I do expect this effort to be rather involved, (and I am terrible at
> wording).
> What do you think?
> Any one willing to help over the next couple of years?
>
>
> Cheers,
>
> Corentin
>
>
>
>
>
>
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2020-05-28 03:52:57