I'm working on a paper that switches C++ to a modified "model B" approach for
universal-character-names as described in the C99 Rationale v5.10, section 5.2.1.
I thought sg16 agreed to not replace ucn until phase 5 a few meetings ago, did I completely missunderstood what sg16 agreed ?
There are some facts that are hard to reconcile in a nice model:
- Concatenation of string-literals might change the meaning of
numeric-escape-sequences, e.g. "\x5" "e" should not become "\x5e".
- In general, string-literals contain (Unicode) characters, but
a numeric-escape-sequences embodies a code unit (not a character).
- We can't translate some escape-sequences earlier and some
escape-sequences later, because "\\x5e" might turn into the
code unit 0x5e that way, but the four characters \x5e were
- Not all string-literals should be transcoded to execution (literal)
encoding. For example, the argument to static_assert should not be
My current idea is to focus on the creation of the string literal
object; that's when transcoding to execution (literal) encoding
happens. All other uses of string-literals don't produce objects,
so aren't transcoded.
In order to be able to interpret escape-sequences in phase 5/6,
we need a "tunnel" for numeric-escape-sequences. One idea would
be to add "code unit characters" to the translation character set,
where each such character represents a code unit coming from a
numeric-escape-sequence. The sole purpose is to keep the
code units safe until we produce the initializer for the
string literal object.
The alternative would be to delay all interpretation of escape-
sequences to when we produce the initializer for the string
literal object, but that also means we need to delay string
literal concatenation until that time (see first item above).
Would that cause any issue? This would otherwise be my preferred solution!