On Thu, Dec 17, 2020 at 4:33 PM Jens Maurer via SG16 <sg16@lists.isocpp.org> wrote:

I'm working on a paper that switches C++ to a modified "model B" approach for
universal-character-names as described in the C99 Rationale v5.10, section 5.2.1.

There are some facts that are hard to reconcile in a nice model:

 - Concatenation of string-literals might change the meaning of
numeric-escape-sequences, e.g.  "\x5" "e"  should not become "\x5e".

 - In general, string-literals contain (Unicode) characters, but
a numeric-escape-sequences embodies a code unit (not a character).

 - We can't translate some escape-sequences earlier and some
escape-sequences later, because  "\\x5e"  might turn into the
code unit 0x5e that way, but the four characters \x5e were
actually intended.

 - Not all string-literals should be transcoded to execution (literal)
encoding.  For example, the argument to static_assert should not be
so treated.
I guess this is also the case for strings meant for compiler extensions (like extended asm syntax).
 


My current idea is to focus on the creation of the string literal
object; that's when transcoding to execution (literal) encoding
happens. All other uses of string-literals don't produce objects,
so aren't transcoded.
I'm not sure there's a real use of a string literal object here:
extern "\x43" "++" {}

but various compilers accept the code.


In order to be able to interpret escape-sequences in phase 5/6,
we need a "tunnel" for numeric-escape-sequences.  One idea would
be to add "code unit characters" to the translation character set,
where each such character represents a code unit coming from a
numeric-escape-sequence.  The sole purpose is to keep the
code units safe until we produce the initializer for the
string literal object.

The alternative would be to delay all interpretation of escape-
sequences to when we produce the initializer for the string
literal object, but that also means we need to delay string
literal concatenation until that time (see first item above).
Delaying string literal concatenation introduces knock-on effects:
int operator "" "" _hello(const char *);

So, keeping code units safe until we need to know the contents of the string for some reason or another sounds like a good direction.
 

Ideas? Opinions?

Jens
--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16