sg16: [SG16] Handling literals throughout the translation phases

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Thu, 17 Dec 2020 22:33:12 +0100

I'm working on a paper that switches C++ to a modified "model B" approach for
universal-character-names as described in the C99 Rationale v5.10, section 5.2.1.

There are some facts that are hard to reconcile in a nice model:

- Concatenation of string-literals might change the meaning of
numeric-escape-sequences, e.g. "\x5" "e" should not become "\x5e".

- In general, string-literals contain (Unicode) characters, but
a numeric-escape-sequences embodies a code unit (not a character).

- We can't translate some escape-sequences earlier and some
escape-sequences later, because "\\x5e" might turn into the
code unit 0x5e that way, but the four characters \x5e were
actually intended.

- Not all string-literals should be transcoded to execution (literal)
encoding. For example, the argument to static_assert should not be
so treated.

My current idea is to focus on the creation of the string literal
object; that's when transcoding to execution (literal) encoding
happens. All other uses of string-literals don't produce objects,
so aren't transcoded.

In order to be able to interpret escape-sequences in phase 5/6,
we need a "tunnel" for numeric-escape-sequences. One idea would
be to add "code unit characters" to the translation character set,
where each such character represents a code unit coming from a
numeric-escape-sequence. The sole purpose is to keep the
code units safe until we produce the initializer for the
string literal object.

The alternative would be to delay all interpretation of escape-
sequences to when we produce the initializer for the string
literal object, but that also means we need to delay string
literal concatenation until that time (see first item above).

Ideas? Opinions?

Jens

Received on 2020-12-17 15:33:16