sg16: Re: [SG16] Redefining Lexing in terms of Unicode

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Thu, 28 May 2020 10:53:24 +0200

On Thu, 28 May 2020 at 10:20, Peter Brett via SG16 <sg16_at_[hidden]>
wrote:

> Hi Corentin,
>
>
>
> What is the benefit of complecting together these two issues?
>
>
>
> - Redefining lexing in terms of a nominal Unicode internal encoding
> and eliminating universal-character-name
> - Restricting where \u or \U sequences may appear
>
>
The support for \u in identifiers is rather costly in terms of wording - we
could conserve it if there is a strong motivation for it but it would
certainly weaken a bit the
motivation for some wording changes.
One of the issue is that these escapes sequences are reverted in raw string
literals, they seem to require quite a bit of machinery.

> Best regards,
>
>
>
> Peter
>
>
>
> *From:* SG16 <sg16-bounces_at_[hidden]> *On Behalf Of *Corentin via
> SG16
> *Sent:* 28 May 2020 09:04
> *To:* SG16 <sg16_at_[hidden]>
> *Cc:* Corentin <corentin.jabot_at_[hidden]>
> *Subject:* [SG16] Redefining Lexing in terms of Unicode
>
>
>
> EXTERNAL MAIL
>
> Hello.
>
> Following some Twitter discussions with Tom, Alisdair and Steve, I would
> like to propose that lexing should be redefined in terms of Unicode.
>
> This would be mostly a wording change with limited effect on
> implementations and existing code.
>
>
>
> Current state of affair:
>
>
>
> Any character not in the basic source character set is converted to a
> universal character name \uxxxx, whose values map 1-1 to unicode code points
>
> The execution character set is defined in terms of the basic source
> character set
>
> \u and \U sequences can appear in identifiers and strings
>
> \u and \U sequences are reverted in raw string literals.
>
>
>
>
>
> Proposed, broad strokes
>
>
>
> - In phase 1, Abstract physical characters are mapped 1-1 to a
> sequence of Unicode code points that represent these characters, such that
> the internal representation and the physical source represent the same
> sequence of abstract characters. This tightens what
> transformations implementers can do in phase 1
> - Additionally in phase 1, we want to mandate that compiler support
> source files that are utf8-encoded (aka there must exist some mechanism for
> the compiler to accept such physical source files, it doesn't need to be
> the only supported format or even the default)
> - The internal representation is a sequence of Unicode codepoint, but
> the way it is represented or stored is not specified. This would still
> allow implementations to store code-points as \uxxxx if they so desired.
> - The notion of universal character name is removed, the wording would
> consistently refer to Unicode code points
> - \u and \U sequences are redefined as escape sequences for string and
> character literals.
> - raw string literals would only require reverting line splitting
> - The basic execution character sets (narrow and wide) are redefined
> such that they don't depend on the definition of basic source character set
> - but they remained unchanged
> - The notion of basic source character set is removed
> - Source character set is redefined as being the Unicode character set
> - The grammar of identifier would be redefined in terms of XID_Start +
> _ and XID_Continue, pending P1949 approval
>
>
>
> The intent with these changes is to limit modifications to the behavior or
> implementation of existing implementations, there is however a breaking
> behavior change
>
>
>
> *Identifiers which contain \u or \U escape sequences would become
> ill-formed since with these new rules \u and \U can only appear in string
> and characters literals.*
>
>
>
> I suggest that either
>
> - We make such identifier ill-formed
>
> - We make such identifier deprecated.
>
>
>
> The reason is that this feature is not well-motivated (such identifier
> exists for compatibility between files of different encoding which cannot
> represent the same characters but in practice
>
> can only be used on extern identifiers and identifiers declared in modules
> or imported headers as most implementations do not provide a per-header
> encoding selection mechanism), and
>
> is hard to use properly (such identifiers are indeed hardly readable)
>
>
>
> The same result could be achieve with a reification operator such as
> proposed by P1240, ie: [: "foo\u0300" :] = 42;
>
>
>
> The hope is that these changes would make it less confusing for any one
> involve how lexing is perform.
>
> I do expect this effort to be rather involved, (and I am terrible at
> wording).
>
> What do you think?
>
> Any one willing to help over the next couple of years?
>
>
>
>
>
> Cheers,
>
>
>
> Corentin
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2020-05-28 03:56:41