C++ Logo

sg16

Advanced search

[SG16] "characters" during lexing

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Thu, 11 Mar 2021 08:35:32 +0100
Hi,

We have discussed the use of term "character" during lexing,
and its replacement with "UCS scalar value".

Let's look at [lex.pptoken] p3.2:

— Otherwise, if the next three characters are <:: and the subsequent character is neither : nor >, the <
is treated as a preprocessing token by itself and not as the first character of the alternative token <:.

Under the "translation character set" formulation, there are two sub-approaches:
 (1) make unassigned UCS scalar values "characters" of some sort
 (2) don't claim those are "characters", but just some alien matter (yet still set elements)

Using (2), the text above becomes wrong:
The phrasing "if the next three characters are..." turns from descriptive to restrictive.

Consider a sequence
  LESS-THAN COLON <unassigned code point> COLON

Under (1), it's clear that the condition is not matched, since
LESS-THAN COLON <unassigned code point> does not match <:: .
Under (2), the next three "characters" in the sequence are
LESS-THAN COLON COLON, i.e. we ignore the unassigned codepoint,
and the condition is satisfied. Which would be a change of meaning.

I've thus reverted to the previous definition which makes
all elements of the translation character set "characters":

The translation character set consists of the following elements:
 - each character named by ISO/IEC 10646, as identified by its unique UCS scalar value, and
 - a distinct character for each UCS scalar value where no named character is assigned.


The same consideration applies when using "UCS scalar value" as the core term,
so it's mandatory that we massage the quoted text to keep its meaning, maybe to
something like this:

 - Otherwise, if the next three UCS scalar values are <:: and the subsequent UCS scalar value is neither : nor >, the <
is treated as a preprocessing token by itself and not as the first character of the alternative token <:.

(The last "character" here can stay, because it refers to < only.)

Jens

Received on 2021-03-11 01:35:38