C++ Logo


Advanced search

Re: [SG16] Redefining Lexing in terms of Unicode

From: Steve Downey <sdowney_at_[hidden]>
Date: Thu, 28 May 2020 23:23:01 -0400
This is the main reason to fix the lexer/grammar — they no longer
correspond closely enough to the physical systems to be useful models.

The 'basic source character set' is an entirely abstract set of symbols,
corresponding to nothing in either actual source or actual compiled
translation units. The 'execution character sets' are similar. They are
symbols and may or may not be encodable into the resulting program. There
are good ways of discussing these problems today, but not in the language
of the standard as it is.

If '@' can be represented in a character literal of some kind is not the
same question as if if is available to name an operator. Conflating those
questions isn't useful.

The cost of understanding the C++ abstract lexer is large, and is of little
value outside of modifying it, as actual compilers have since "as-if"ed
around it. It's an ongoing cost. Technical debt.

To answer Jen's implied question from upstream:
"While the use of universal-character-names might appear
a bit baroque to Unicode-oriented people, it seems to work
nicely. "

Unicode-oriented people are essentially everyone today. No one who works
with text today, including people who work on computer language text
outside C and C++ Standard, expect or will deal with anything else.
Discussing decoding actual source files into 'source character set' and
encoding into 'execution character set' values is a lossy translation
hindering actual discussion of what C++ implementations do.

On Thu, May 28, 2020 at 7:42 PM Corentin Jabot <corentinjabot_at_[hidden]>

> On Fri, May 29, 2020, 01:03 Steve Downey via SG16 <sg16_at_[hidden]>
> wrote:
>> I don't think it would change the difficulty of adding new characters for
>> keywords and operators. It would be no harder than adding @ to the basic
>> source character set.
> +1.
> It is currently assumed, but not stated that the characters that are in
> the basic character set can be used in arbitrary grammar elements.
> This doesn't have to be the case.
> For example adding "@" to the basic character set would not mean we could
> use it as an operator.
> Nor apparently that it would have to be representable in the physical
> character set. But it would have to be representable in the execution
> character set...
> (At least what I propose would resolve that last point.)
>> On Thu, May 28, 2020, 18:40 Thiago Macieira via SG16 <
>> sg16_at_[hidden]> wrote:
>>> On Thursday, 28 May 2020 01:04:23 PDT Corentin via SG16 wrote:
>>> > - The notion of basic source character set is removed
>>> Does that mean @ can be used as a new operator?
>>> --
>>> Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
>>> Software Architect - Intel System Software Products
>>> --
>>> SG16 mailing list
>>> SG16_at_[hidden]
>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16

Received on 2020-05-28 22:26:20