sg16: [SG16] New core issue: [lex.phases]: The order of translation phases 5 and 6 contradict the [lex.string] specification for concatenation of string literals

From: Tom Honermann <tom_at_[hidden]>
Date: Thu, 2 Jul 2020 17:22:26 -0400

This issue was originally reported in
https://lists.isocpp.org/core/2019/03/5770.php, but it looks like a core
issue was never recorded. This request obsoletes the previous one.

(Mike, could we please, pretty please with a cup of moonshine and a
plate of chocolate truffles, get a new public facing publication of the
active issues list? At present, we can't rely on https://wg21.link
directing users to a paper with recently created issues or up to date
status information)

[lex.phases] <http://eel.is/c++draft/lex.phases> states that string
literals are converted to the execution character set prior to being
concatenated:

> 5.Each basic source character set member in a /character-literal/ or a
> /string-literal/, as well as each escape sequence and
> /universal-character-name/ in a /character-literal/ or a non-raw
> string literal, is converted to the corresponding member of the
> execution character set ([lex.ccon] <http://eel.is/c++draft/lex.ccon>,
> [lex.string] <http://eel.is/c++draft/lex.string>); if there is no
> corresponding member, it is converted to an implementation-defined
> member other than the null (wide) character.
> 6.Adjacent string literal tokens are concatenated.
However, [lex.string]p11 <http://eel.is/c++draft/lex.string#11>
specifies that encoding conversion is dependent on the presence of
/encoding-prefix/es on the set of string literals that will be concatenated:

> 11 In translation phase 6 ([lex.phases]
> <http://eel.is/c++draft/lex.phases>), adjacent /string-literal/s are
> concatenated. If both /string-literal/s have the same
> /encoding-prefix/, the resulting concatenated /string-literal/ has
> that /encoding-prefix/. If one /string-literal/ has no
> /encoding-prefix/, it is treated as a /string-literal/ of the same
> /encoding-prefix/ as the other operand. If a UTF-8 string literal
> token is adjacent to a wide string literal token, the program is
> ill-formed. Any other concatenations are conditionally-supported with
> implementation-defined behavior. [/Note:/ This concatenation is an
> interpretation, not a conversion. Because the interpretation happens
> in translation phase 6 (after each character from a string-literal has
> been translated into a value from the appropriate character set), a
> /string-literal/'s initial rawness has no effect on the interpretation
> or well-formedness of the concatenation. — /end note/ ] Table 11 has
> some examples of valid concatenations.
>
> Table 11 <http://eel.is/c++draft/lex.string#tab:lex.string.concat>:
> String literal concatenations
> [tab:lex.string.concat]
>
>
> Source Means
> u"a" u"b"
> u"a" "b"
> "a" u"b"
>
>
> u"ab"
> u"ab"
> u"ab"
>
>
> Source Means
> U"a" U"b"
> U"a" "b"
> "a" U"b"
>
>
> U"ab"
> U"ab"
> U"ab"
>
>
> Source Means
> L"a" L"b"
> L"a" "b"
> "a" L"b"
>
>
> L"ab"
> L"ab"
> L"ab"
>
>
>
> Characters in concatenated strings are kept distinct.
>
> [/Example:/
>
> "\xA" "B"
>
> contains the two characters '\xA' and 'B' after concatenation (and not
> the single hexadecimal character '\xAB'). — /end example/ ]

The intent expressed by the prose and the note seems to be that string
literals with different /encoding-prefix/es be separately converted and
then joined together, such that the resulting string literal potentially
consists of code unit sequences corresponding to different character
encodings. However, that conflicts with the intent expressed by the
table that specifies that, for example, `u"a" "b"` means the same as
`u"ab"`.

There is implementation divergence. GCC and Clang implement the intent
expressed in the table, but Visual C++ implements ... something else.

The difference is illustrated at https://msvc.godbolt.org/z/Dcrgda using
the code below:

> const char8_t* u8_1 = "" u8"\u0102";
> const char8_t* u8_2 = u8"" "\u0102";
> const char8_t* u8_3 = u8"" u8"\u0102";
>
> const char16_t* u16_1 = "" u"\u0102";
> const char16_t* u16_2 = u"" "\u0102";
> const char16_t* u16_3 = u"" u"\u0102";
>
> const char32_t* u32_1 = "" U"\u0102";
> const char32_t* u32_2 = U"" "\u0102";
> const char32_t* u32_3 = U"" U"\u0102";
>
> const wchar_t* w_1 = "" L"\u0102";
> const wchar_t* w_2 = L"" "\u0102";
> const wchar_t* w_3 = L"" L"\u0102";

GCC and Clang produce the same sequence of encoded code units for each
combination of /encoding-prefix/ encoded according to the
/encoding-prefix/. Visual C++ produces the same code unit sequence for
the *_1 and *_3 variants, but not for the *_2 variants; the latter have
their second string literal component first encoded to the execution
character set (substituting a '?' character if the character is not
representable in that character set), and then re-encodes that code unit
sequence (interpreting it according to the current locale the compiler
is operating in) using the encoding indicated by the /encoding-prefix/
of the other string literal component. The double encoding behavior of
the Visual C++ compiler, and in particular the use of the wrong source
encoding for the second conversion, are presumably unintended behaviors.

In addition to the above, translation phase 5 states that conversion is
unconditionally to the execution character set, but that is obviously
incorrect for string literals that have an /encoding-prefix/ present.

Swapping translation phases 5 and 6 is not a viable resolution because
that would interfere with processing of escape sequences. `"\33" "3"` is
not intended to be the same as `"\333"` and separate processing of
string literal components is necessary in some cases involving
/hexadecimal-escape-sequence/. While the /octal-escape-sequence/ case
can be worked around with `"\0333"`, the string intended by `"\xab" "c"`
cannot be represented as a single string literal.

Richard Smith suggested a possible resolution at
https://groups.google.com/a/isocpp.org/forum/#!msg/std-discussion/qYf6treuLmY/dljWwyawCwAJ:

> Swapping phase 5 and 6 is certainly wrong. See the example in
> [lex.string]p12:
>
> "[Example:
> "\xA" "B"
> contains the two characters ’\xA’ and ’B’ after concatenation (and not
> the single hexadecimal character
> ’\xAB’). — end example]
>
> If you concatenate and then interpret escape sequences, you
> misinterpret escape sequences.
>
> Instead, I think we should remove phases 5 and 6 entirely, parse one
> or more string-literal tokens as a string literal expression, and only
> perform the translation from the contents of the string literal tokens
> into characters in the execution character set as part of specifying
> the semantics of a string literal expression.

Tom.

Received on 2020-07-02 16:25:45