This issue was originally reported in https://lists.isocpp.org/core/2019/03/5770.php, but it looks like a core issue was never recorded. This request obsoletes the previous one.
(Mike, could we please, pretty please with a cup of moonshine and
a plate of chocolate truffles, get a new public facing publication
of the active issues list? At present, we can't rely on https://wg21.link
directing users to a paper with recently created issues or up to
date status information)
[lex.phases] states that string literals are converted to the execution character set prior to being concatenated:
5.Each basic source character set member in a character-literal or a string-literal, as well as each escape sequence and universal-character-name in a character-literal or a non-raw string literal, is converted to the corresponding member of the execution character set ([lex.ccon], [lex.string]); if there is no corresponding member, it is converted to an implementation-defined member other than the null (wide) character.
6.Adjacent string literal tokens are concatenated.However, [lex.string]p11 specifies that encoding conversion is dependent on the presence of encoding-prefixes on the set of string literals that will be concatenated:
11 In translation phase 6 ([lex.phases]), adjacent string-literals are concatenated. If both string-literals have the same encoding-prefix, the resulting concatenated string-literal has that encoding-prefix. If one string-literal has no encoding-prefix, it is treated as a string-literal of the same encoding-prefix as the other operand. If a UTF-8 string literal token is adjacent to a wide string literal token, the program is ill-formed. Any other concatenations are conditionally-supported with implementation-defined behavior. [Note: This concatenation is an interpretation, not a conversion. Because the interpretation happens in translation phase 6 (after each character from a string-literal has been translated into a value from the appropriate character set), a string-literal's initial rawness has no effect on the interpretation or well-formedness of the concatenation. — end note ] Table 11 has some examples of valid concatenations.
Table 11: String literal concatenations
u"a" u"b" u"a" "b" "a" u"b"
u"ab" u"ab" u"ab"
U"a" U"b" U"a" "b" "a" U"b"
U"ab" U"ab" U"ab"
L"a" L"b" L"a" "b" "a" L"b"
L"ab" L"ab" L"ab"
Characters in concatenated strings are kept distinct.
contains the two characters '\xA' and 'B' after concatenation (and not the single hexadecimal character '\xAB'). — end example ]
The intent expressed by the prose and the note seems to be that string literals with different encoding-prefixes be separately converted and then joined together, such that the resulting string literal potentially consists of code unit sequences corresponding to different character encodings. However, that conflicts with the intent expressed by the table that specifies that, for example, `u"a" "b"` means the same as `u"ab"`.
There is implementation divergence. GCC and Clang implement the intent expressed in the table, but Visual C++ implements ... something else.
The difference is illustrated at https://msvc.godbolt.org/z/Dcrgda using the code below:
const char8_t* u8_1 = "" u8"\u0102";
const char8_t* u8_2 = u8"" "\u0102";
const char8_t* u8_3 = u8"" u8"\u0102";
const char16_t* u16_1 = "" u"\u0102";
const char16_t* u16_2 = u"" "\u0102";
const char16_t* u16_3 = u"" u"\u0102";
const char32_t* u32_1 = "" U"\u0102";
const char32_t* u32_2 = U"" "\u0102";
const char32_t* u32_3 = U"" U"\u0102";
const wchar_t* w_1 = "" L"\u0102";
const wchar_t* w_2 = L"" "\u0102";
const wchar_t* w_3 = L"" L"\u0102";
GCC and Clang produce the same sequence of encoded code units for each combination of encoding-prefix encoded according to the encoding-prefix. Visual C++ produces the same code unit sequence for the *_1 and *_3 variants, but not for the *_2 variants; the latter have their second string literal component first encoded to the execution character set (substituting a '?' character if the character is not representable in that character set), and then re-encodes that code unit sequence (interpreting it according to the current locale the compiler is operating in) using the encoding indicated by the encoding-prefix of the other string literal component. The double encoding behavior of the Visual C++ compiler, and in particular the use of the wrong source encoding for the second conversion, are presumably unintended behaviors.
In addition to the above, translation phase 5 states that conversion is unconditionally to the execution character set, but that is obviously incorrect for string literals that have an encoding-prefix present.
Swapping translation phases 5 and 6 is not a viable resolution
because that would interfere with processing of escape sequences.
`"\33" "3"` is not intended to be the same as `"\333"` and
separate processing of string literal components is necessary in
some cases involving hexadecimal-escape-sequence. While
the octal-escape-sequence case can be worked around with
`"\0333"`, the string intended by `"\xab" "c"` cannot be
represented as a single string literal.
Richard Smith suggested a possible resolution at https://groups.google.com/a/isocpp.org/forum/#!msg/std-discussion/qYf6treuLmY/dljWwyawCwAJ:
Swapping phase 5 and 6 is certainly wrong. See the example in [lex.string]p12:
contains the two characters ’\xA’ and ’B’ after concatenation (and not
the single hexadecimal character
’\xAB’). — end example]
If you concatenate and then interpret escape sequences, you
misinterpret escape sequences.
Instead, I think we should remove phases 5 and 6 entirely, parse one
or more string-literal tokens as a string literal expression, and only
perform the translation from the contents of the string literal tokens
into characters in the execution character set as part of specifying
the semantics of a string literal expression.