C++ Logo


Advanced search

Re: [SG16] Redefining Lexing in terms of Unicode

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Thu, 28 May 2020 09:48:33 -0400
On Thu, May 28, 2020 at 4:04 AM Corentin via SG16 <sg16_at_[hidden]>

> Hello.
> Following some Twitter discussions with Tom, Alisdair and Steve, I would
> like to propose that lexing should be redefined in terms of Unicode.
> This would be mostly a wording change with limited effect on
> implementations and existing code.
> Current state of affair:
> Any character not in the basic source character set is converted to a
> universal character name \uxxxx, whose values map 1-1 to unicode code points
> The execution character set is defined in terms of the basic source
> character set
> \u and \U sequences can appear in identifiers and strings
> \u and \U sequences are reverted in raw string literals.
> Proposed, broad strokes
> - In phase 1, Abstract physical characters are mapped 1-1 to a
> sequence of Unicode code points that represent these characters, such that
> the internal representation and the physical source represent the same
> sequence of abstract characters. This tightens what
> transformations implementers can do in phase 1
> Please note that the trigraph removal paper was presented as allowing
continued recognition of trigraphs via the implementation-defined phase 1
remapping (although that was problematic for raw strings).
Please note that the understanding of what constitutes line breaks differs
between implementations and that, technically, whitespace after a backslash
and before a line break is significant. The details of the proposed change
will determine whether or not such significant whitespace becomes more

> - Additionally in phase 1, we want to mandate that compiler support
> source files that are utf8-encoded (aka there must exist some mechanism for
> the compiler to accept such physical source files, it doesn't need to be
> the only supported format or even the default)
> - The internal representation is a sequence of Unicode codepoint, but
> the way it is represented or stored is not specified. This would still
> allow implementations to store code-points as \uxxxx if they so desired.
> Not directly in phase 1 if implementations have to deal with "\<what is a
Unicode character today>" as if the \\ would not appear.

> - The notion of universal character name is removed, the wording would
> consistently refer to Unicode code points
> - \u and \U sequences are redefined as escape sequences for string and
> character literals.
> - raw string literals would only require reverting line splitting
> - The basic execution character sets (narrow and wide) are redefined
> such that they don't depend on the definition of basic source character set
> - but they remained unchanged
> - The notion of basic source character set is removed
> Please address the display behaviour required for static_assert for ANSI
escape sequences, etc.
Please also address all the uses of this term in the library section.

> - Source character set is redefined as being the Unicode character set
> It seems like we're encouraging homoglyph issues. Do we expect open source
projects to maintain coding guidelines that restrict characters outside the
ASCII range?

> - The grammar of identifier would be redefined in terms of XID_Start +
> _ and XID_Continue, pending P1949 approval
> The intent with these changes is to limit modifications to the behavior or
> implementation of existing implementations, there is however a breaking
> behavior change
> *Identifiers which contain \u or \U escape sequences would become
> ill-formed since with these new rules \u and \U can only appear in string
> and characters literals.*
> I suggest that either
> - We make such identifier ill-formed
> - We make such identifier deprecated.
> The reason is that this feature is not well-motivated (such identifier
> exists for compatibility between files of different encoding which cannot
> represent the same characters but in practice
> can only be used on extern identifiers and identifiers declared in modules
> or imported headers as most implementations do not provide a per-header
> encoding selection mechanism), and
> is hard to use properly (such identifiers are indeed hardly readable)
It seems that encoding the stuff outside the basic source character set as
UCNs in headers is exactly how one would avoid per-header encoding
selection given the practical reality.
The practical reality being that most encodings that are intermixed have
the same encoded value for most of the members of the basic source
character set.
Thus, we have the concept of a basic source character set (and we also have
digraphs and C's iso646.h).
Therefore, although not absolutely true, simply avoiding characters outside
the basic source character set (and those requiring digraphs, etc.) is
generally good enough for allowing headers to be included for compilation
with source specified (via command line, etc.) as being in different

> The same result could be achieve with a reification operator such as
> proposed by P1240, ie: [: "foo\u0300" :] = 42;
This mitigation for the problem you identified is not guaranteed. Lacking
such mitigation, developers would be forced by libraries to switch to
Unicode source even if they do not wish to.
(Okay, there is such a thing as compiler extensions for __asm__ names, but
their usability is limited).

> The hope is that these changes would make it less confusing for any one
> involve how lexing is perform.
> I do expect this effort to be rather involved, (and I am terrible at
> wording).
> What do you think?
> Any one willing to help over the next couple of years?
> Cheers,
> Corentin
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16

Received on 2020-05-28 08:52:48