Date: Thu, 28 May 2020 10:04:23 +0200
Hello.
Following some Twitter discussions with Tom, Alisdair and Steve, I would
like to propose that lexing should be redefined in terms of Unicode.
This would be mostly a wording change with limited effect on
implementations and existing code.
Current state of affair:
Any character not in the basic source character set is converted to a
universal character name \uxxxx, whose values map 1-1 to unicode code points
The execution character set is defined in terms of the basic source
character set
\u and \U sequences can appear in identifiers and strings
\u and \U sequences are reverted in raw string literals.
Proposed, broad strokes
- In phase 1, Abstract physical characters are mapped 1-1 to a sequence
of Unicode code points that represent these characters, such that the
internal representation and the physical source represent the same sequence
of abstract characters. This tightens what transformations implementers can
do in phase 1
- Additionally in phase 1, we want to mandate that compiler support
source files that are utf8-encoded (aka there must exist some mechanism for
the compiler to accept such physical source files, it doesn't need to be
the only supported format or even the default)
- The internal representation is a sequence of Unicode codepoint, but
the way it is represented or stored is not specified. This would still
allow implementations to store code-points as \uxxxx if they so desired.
- The notion of universal character name is removed, the wording would
consistently refer to Unicode code points
- \u and \U sequences are redefined as escape sequences for string and
character literals.
- raw string literals would only require reverting line splitting
- The basic execution character sets (narrow and wide) are redefined
such that they don't depend on the definition of basic source character set
- but they remained unchanged
- The notion of basic source character set is removed
- Source character set is redefined as being the Unicode character set
- The grammar of identifier would be redefined in terms of XID_Start + _
and XID_Continue, pending P1949 approval
The intent with these changes is to limit modifications to the behavior or
implementation of existing implementations, there is however a breaking
behavior change
*Identifiers which contain \u or \U escape sequences would become
ill-formed since with these new rules \u and \U can only appear in string
and characters literals.*
I suggest that either
- We make such identifier ill-formed
- We make such identifier deprecated.
The reason is that this feature is not well-motivated (such identifier
exists for compatibility between files of different encoding which cannot
represent the same characters but in practice
can only be used on extern identifiers and identifiers declared in modules
or imported headers as most implementations do not provide a per-header
encoding selection mechanism), and
is hard to use properly (such identifiers are indeed hardly readable)
The same result could be achieve with a reification operator such as
proposed by P1240, ie: [: "foo\u0300" :] = 42;
The hope is that these changes would make it less confusing for any one
involve how lexing is perform.
I do expect this effort to be rather involved, (and I am terrible at
wording).
What do you think?
Any one willing to help over the next couple of years?
Cheers,
Corentin
Following some Twitter discussions with Tom, Alisdair and Steve, I would
like to propose that lexing should be redefined in terms of Unicode.
This would be mostly a wording change with limited effect on
implementations and existing code.
Current state of affair:
Any character not in the basic source character set is converted to a
universal character name \uxxxx, whose values map 1-1 to unicode code points
The execution character set is defined in terms of the basic source
character set
\u and \U sequences can appear in identifiers and strings
\u and \U sequences are reverted in raw string literals.
Proposed, broad strokes
- In phase 1, Abstract physical characters are mapped 1-1 to a sequence
of Unicode code points that represent these characters, such that the
internal representation and the physical source represent the same sequence
of abstract characters. This tightens what transformations implementers can
do in phase 1
- Additionally in phase 1, we want to mandate that compiler support
source files that are utf8-encoded (aka there must exist some mechanism for
the compiler to accept such physical source files, it doesn't need to be
the only supported format or even the default)
- The internal representation is a sequence of Unicode codepoint, but
the way it is represented or stored is not specified. This would still
allow implementations to store code-points as \uxxxx if they so desired.
- The notion of universal character name is removed, the wording would
consistently refer to Unicode code points
- \u and \U sequences are redefined as escape sequences for string and
character literals.
- raw string literals would only require reverting line splitting
- The basic execution character sets (narrow and wide) are redefined
such that they don't depend on the definition of basic source character set
- but they remained unchanged
- The notion of basic source character set is removed
- Source character set is redefined as being the Unicode character set
- The grammar of identifier would be redefined in terms of XID_Start + _
and XID_Continue, pending P1949 approval
The intent with these changes is to limit modifications to the behavior or
implementation of existing implementations, there is however a breaking
behavior change
*Identifiers which contain \u or \U escape sequences would become
ill-formed since with these new rules \u and \U can only appear in string
and characters literals.*
I suggest that either
- We make such identifier ill-formed
- We make such identifier deprecated.
The reason is that this feature is not well-motivated (such identifier
exists for compatibility between files of different encoding which cannot
represent the same characters but in practice
can only be used on extern identifiers and identifiers declared in modules
or imported headers as most implementations do not provide a per-header
encoding selection mechanism), and
is hard to use properly (such identifiers are indeed hardly readable)
The same result could be achieve with a reification operator such as
proposed by P1240, ie: [: "foo\u0300" :] = 42;
The hope is that these changes would make it less confusing for any one
involve how lexing is perform.
I do expect this effort to be rather involved, (and I am terrible at
wording).
What do you think?
Any one willing to help over the next couple of years?
Cheers,
Corentin
Received on 2020-05-28 03:07:41