Re: [SG16] Redefining Lexing in terms of Unicode

From: Corentin <corentin.jabot_at_[hidden]>
Date: Thu, 28 May 2020 16:10:08 +0200
On Thu, 28 May 2020 at 15:48, Hubert Tong <hubert.reinterpretcast_at_[hidden]>

> On Thu, May 28, 2020 at 4:04 AM Corentin via SG16 <sg16_at_[hidden]>
> wrote:
>> Hello.
>> Following some Twitter discussions with Tom, Alisdair and Steve, I would
>> like to propose that lexing should be redefined in terms of Unicode.
>> This would be mostly a wording change with limited effect on
>> implementations and existing code.
>> Current state of affair:
>> Any character not in the basic source character set is converted to a
>> universal character name \uxxxx, whose values map 1-1 to unicode code points
>> The execution character set is defined in terms of the basic source
>> character set
>> \u and \U sequences can appear in identifiers and strings
>> \u and \U sequences are reverted in raw string literals.
>> Proposed, broad strokes
>> - In phase 1, Abstract physical characters are mapped 1-1 to a
>> sequence of Unicode code points that represent these characters, such that
>> the internal representation and the physical source represent the same
>> sequence of abstract characters. This tightens what
>> transformations implementers can do in phase 1
>> Please note that the trigraph removal paper was presented as allowing
> continued recognition of trigraphs via the implementation-defined phase 1
> remapping (although that was problematic for raw strings).
> Please note that the understanding of what constitutes line breaks differs
> between implementations and that, technically, whitespace after a backslash
> and before a line break is significant. The details of the proposed change
> will determine whether or not such significant whitespace becomes more
> significant.

I sent a mail to core about this specific point.
I think that if we want implementation to be able to omit trailing
whitespace, it might be better to specify that in phase 2 rather than to
allow arbitrary remapping, such that phase 1 would only deal with encoding.

>> - Additionally in phase 1, we want to mandate that compiler support
>> source files that are utf8-encoded (aka there must exist some mechanism for
>> the compiler to accept such physical source files, it doesn't need to be
>> the only supported format or even the default)
>> - The internal representation is a sequence of Unicode codepoint, but
>> the way it is represented or stored is not specified. This would still
>> allow implementations to store code-points as \uxxxx if they so desired.
>> Not directly in phase 1 if implementations have to deal with "\<what is a
> Unicode character today>" as if the \\ would not appear.
>> - The notion of universal character name is removed, the wording
>> would consistently refer to Unicode code points
>> - \u and \U sequences are redefined as escape sequences for string
>> and character literals.
>> - raw string literals would only require reverting line splitting
>> - The basic execution character sets (narrow and wide) are redefined
>> such that they don't depend on the definition of basic source character set
>> - but they remained unchanged
>> - The notion of basic source character set is removed
>> Please address the display behaviour required for static_assert for ANSI
> escape sequences, etc.
> Please also address all the uses of this term in the library section.

There would be no change, although the basic source character appears in
library in a few places ( it would be redefined as basic execution encoding)

>> - Source character set is redefined as being the Unicode character set
>> It seems like we're encouraging homoglyph issues. Do we expect open
> source projects to maintain coding guidelines that restrict characters
> outside the ASCII range?

This change would't modify the set of characters that can appear in a
source file.

>> - The grammar of identifier would be redefined in terms of
>> XID_Start + _ and XID_Continue, pending P1949 approval
>> The intent with these changes is to limit modifications to the behavior
>> or implementation of existing implementations, there is however a breaking
>> behavior change
>> *Identifiers which contain \u or \U escape sequences would become
>> ill-formed since with these new rules \u and \U can only appear in string
>> and characters literals.*
>> I suggest that either
>> - We make such identifier ill-formed
>> - We make such identifier deprecated.
>> The reason is that this feature is not well-motivated (such identifier
>> exists for compatibility between files of different encoding which cannot
>> represent the same characters but in practice
>> can only be used on extern identifiers and identifiers declared in
>> modules or imported headers as most implementations do not provide a
>> per-header encoding selection mechanism), and
>> is hard to use properly (such identifiers are indeed hardly readable)

> It seems that encoding the stuff outside the basic source character set as
> UCNs in headers is exactly how one would avoid per-header encoding
> selection given the practical reality.
> The practical reality being that most encodings that are intermixed have
> the same encoded value for most of the members of the basic source
> character set.
> Thus, we have the concept of a basic source character set (and we also
> have digraphs and C's iso646.h).
> Therefore, although not absolutely true, simply avoiding characters
> outside the basic source character set (and those requiring digraphs, etc.)
> is generally good enough for allowing headers to be included for
> compilation with source specified (via command line, etc.) as being in
> different encodings.

We could define a "portable subset". interestingly, i don't think this is
currently the case?
As in the current wording does not prevent a physical character set that
doesn't contain the letter "a", for example
This change wouldn't modify the portability of headers.

>> The same result could be achieve with a reification operator such as
>> proposed by P1240, ie: [: "foo\u0300" :] = 42;
> This mitigation for the problem you identified is not guaranteed. Lacking
> such mitigation, developers would be forced by libraries to switch to
> Unicode source even if they do not wish to.
> (Okay, there is such a thing as compiler extensions for __asm__ names, but
> their usability is limited).

Agreed. We need to establish whether universal character names are actually
used in identifiers in production code

>> The hope is that these changes would make it less confusing for any one
>> involve how lexing is perform.
>> I do expect this effort to be rather involved, (and I am terrible at
>> wording).
>> What do you think?
>> Any one willing to help over the next couple of years?
>> Cheers,
>> Corentin
Received on 2020-05-28 09:13:25