Date: Fri, 18 Dec 2020 13:53:35 +0100
On 18/12/2020 02.51, Hubert Tong wrote:
> On Thu, Dec 17, 2020 at 4:33 PM Jens Maurer via SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
>
> I'm working on a paper that switches C++ to a modified "model B" approach for
> universal-character-names as described in the C99 Rationale v5.10, section 5.2.1.
>
> There are some facts that are hard to reconcile in a nice model:
>
> - Concatenation of string-literals might change the meaning of
> numeric-escape-sequences, e.g. "\x5" "e" should not become "\x5e".
>
> - In general, string-literals contain (Unicode) characters, but
> a numeric-escape-sequences embodies a code unit (not a character).
>
> - We can't translate some escape-sequences earlier and some
> escape-sequences later, because "\\x5e" might turn into the
> code unit 0x5e that way, but the four characters \x5e were
> actually intended.
>
> - Not all string-literals should be transcoded to execution (literal)
> encoding. For example, the argument to static_assert should not be
> so treated.
>
> I guess this is also the case for strings meant for compiler extensions (like extended asm syntax).
Yes, it also applies to an /asm-declaration/, whether extended or not.
> My current idea is to focus on the creation of the string literal
> object; that's when transcoding to execution (literal) encoding
> happens. All other uses of string-literals don't produce objects,
> so aren't transcoded.
>
> I'm not sure there's a real use of a string literal object here:
> extern "\x43" "++" {}
>
> but various compilers accept the code.
I don't think supporting numeric-escape-sequences in string-literals
that do not end up in the executable are reasonable to have,
and maybe we can be backward-incompatible here.
(We're making an assumption about the compiler-internal encoding
of "C" here, it seems, which is at least non-portable if my
compiler uses EBCDIC as its internal encoding.)
> In order to be able to interpret escape-sequences in phase 5/6,
> we need a "tunnel" for numeric-escape-sequences. One idea would
> be to add "code unit characters" to the translation character set,
> where each such character represents a code unit coming from a
> numeric-escape-sequence. The sole purpose is to keep the
> code units safe until we produce the initializer for the
> string literal object.
>
> The alternative would be to delay all interpretation of escape-
> sequences to when we produce the initializer for the string
> literal object, but that also means we need to delay string
> literal concatenation until that time (see first item above).
>
> Delaying string literal concatenation introduces knock-on effects:
> int operator "" "" _hello(const char *);
This can be worked around by having a string-literal-expression
(in phase 7) everywhere we nowadays have a string-literal.
Oh well.
> So, keeping code units safe until we need to know the contents of the string for some reason or another sounds like a good direction.
Jens
> On Thu, Dec 17, 2020 at 4:33 PM Jens Maurer via SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
>
> I'm working on a paper that switches C++ to a modified "model B" approach for
> universal-character-names as described in the C99 Rationale v5.10, section 5.2.1.
>
> There are some facts that are hard to reconcile in a nice model:
>
> - Concatenation of string-literals might change the meaning of
> numeric-escape-sequences, e.g. "\x5" "e" should not become "\x5e".
>
> - In general, string-literals contain (Unicode) characters, but
> a numeric-escape-sequences embodies a code unit (not a character).
>
> - We can't translate some escape-sequences earlier and some
> escape-sequences later, because "\\x5e" might turn into the
> code unit 0x5e that way, but the four characters \x5e were
> actually intended.
>
> - Not all string-literals should be transcoded to execution (literal)
> encoding. For example, the argument to static_assert should not be
> so treated.
>
> I guess this is also the case for strings meant for compiler extensions (like extended asm syntax).
Yes, it also applies to an /asm-declaration/, whether extended or not.
> My current idea is to focus on the creation of the string literal
> object; that's when transcoding to execution (literal) encoding
> happens. All other uses of string-literals don't produce objects,
> so aren't transcoded.
>
> I'm not sure there's a real use of a string literal object here:
> extern "\x43" "++" {}
>
> but various compilers accept the code.
I don't think supporting numeric-escape-sequences in string-literals
that do not end up in the executable are reasonable to have,
and maybe we can be backward-incompatible here.
(We're making an assumption about the compiler-internal encoding
of "C" here, it seems, which is at least non-portable if my
compiler uses EBCDIC as its internal encoding.)
> In order to be able to interpret escape-sequences in phase 5/6,
> we need a "tunnel" for numeric-escape-sequences. One idea would
> be to add "code unit characters" to the translation character set,
> where each such character represents a code unit coming from a
> numeric-escape-sequence. The sole purpose is to keep the
> code units safe until we produce the initializer for the
> string literal object.
>
> The alternative would be to delay all interpretation of escape-
> sequences to when we produce the initializer for the string
> literal object, but that also means we need to delay string
> literal concatenation until that time (see first item above).
>
> Delaying string literal concatenation introduces knock-on effects:
> int operator "" "" _hello(const char *);
This can be worked around by having a string-literal-expression
(in phase 7) everywhere we nowadays have a string-literal.
Oh well.
> So, keeping code units safe until we need to know the contents of the string for some reason or another sounds like a good direction.
Jens
Received on 2020-12-18 06:53:41