C++ Logo

SG16

Advanced search

Subject: Re: [isocpp-core] New core issue: [lex.phases]: The order of translation phases 5 and 6 contradict the [lex.string] specification for concatenation of string literals
From: Tom Honermann (tom_at_[hidden])
Date: 2020-07-13 21:21:24


Mike, can you please confirm receipt of this issue request?

Tom.

On 7/2/20 5:22 PM, Tom Honermann via Core wrote:
>
> This issue was originally reported in
> https://lists.isocpp.org/core/2019/03/5770.php, but it looks like a
> core issue was never recorded.  This request obsoletes the previous one.
>
> (Mike, could we please, pretty please with a cup of moonshine and a
> plate of chocolate truffles, get a new public facing publication of
> the active issues list?  At present, we can't rely on
> https://wg21.link directing users to a paper with recently created
> issues or up to date status information)
>
> [lex.phases] <http://eel.is/c++draft/lex.phases> states that string
> literals are converted to the execution character set prior to being
> concatenated:
>
>> 5.Each basic source character set member in a /character-literal/ or
>> a /string-literal/, as well as each escape sequence and
>> /universal-character-name/ in a /character-literal/ or a non-raw
>> string literal, is converted to the corresponding member of the
>> execution character set ([lex.ccon]
>> <http://eel.is/c++draft/lex.ccon>, [lex.string]
>> <http://eel.is/c++draft/lex.string>); if there is no corresponding
>> member, it is converted to an implementation-defined member other
>> than the null (wide) character.
>> 6.Adjacent string literal tokens are concatenated.
> However, [lex.string]p11 <http://eel.is/c++draft/lex.string#11>
> specifies that encoding conversion is dependent on the presence of
> /encoding-prefix/es on the set of string literals that will be
> concatenated:
>
>> 11 In translation phase 6 ([lex.phases]
>> <http://eel.is/c++draft/lex.phases>), adjacent /string-literal/s are
>> concatenated.  If both /string-literal/s have the same
>> /encoding-prefix/, the resulting concatenated /string-literal/ has
>> that /encoding-prefix/. If one /string-literal/ has no
>> /encoding-prefix/, it is treated as a /string-literal/ of the same
>> /encoding-prefix/ as the other operand.  If a UTF-8 string literal
>> token is adjacent to a wide string literal token, the program is
>> ill-formed.  Any other concatenations are conditionally-supported
>> with implementation-defined behavior.  [/Note:/ This concatenation is
>> an interpretation, not a conversion. Because the interpretation
>> happens in translation phase 6 (after each character from a
>> string-literal has been translated into a value from the appropriate
>> character set), a /string-literal/'s initial rawness has no effect on
>> the interpretation or well-formedness of the concatenation.  — /end
>> note/ ] Table 11 has some examples of valid concatenations.
>>
>> Table 11 <http://eel.is/c++draft/lex.string#tab:lex.string.concat>:
>> String literal concatenations
>> [tab:lex.string.concat]
>>
>>
>> Source Means
>> u"a" u"b"
>> u"a" "b"
>> "a" u"b"
>>
>>
>> u"ab"
>> u"ab"
>> u"ab"
>>
>>
>> Source Means
>> U"a" U"b"
>> U"a" "b"
>> "a" U"b"
>>
>>
>> U"ab"
>> U"ab"
>> U"ab"
>>
>>
>> Source Means
>> L"a" L"b"
>> L"a" "b"
>> "a" L"b"
>>
>>
>> L"ab"
>> L"ab"
>> L"ab"
>>
>>
>>
>> Characters in concatenated strings are kept distinct.
>>
>> [/Example:/
>>
>>    "\xA" "B"
>>
>> contains the two characters '\xA' and 'B' after concatenation (and
>> not the single hexadecimal character '\xAB').  — /end example/ ]
>
> The intent expressed by the prose and the note seems to be that string
> literals with different /encoding-prefix/es be separately converted
> and then joined together, such that the resulting string literal
> potentially consists of code unit sequences corresponding to different
> character encodings. However, that conflicts with the intent expressed
> by the table that specifies that, for example, `u"a" "b"` means the
> same as `u"ab"`.
>
> There is implementation divergence.  GCC and Clang implement the
> intent expressed in the table, but Visual C++ implements ... something
> else.
>
> The difference is illustrated at https://msvc.godbolt.org/z/Dcrgda
> using the code below:
>
>> const char8_t* u8_1 = "" u8"\u0102";
>> const char8_t* u8_2 = u8"" "\u0102";
>> const char8_t* u8_3 = u8"" u8"\u0102";
>>
>> const char16_t* u16_1 = "" u"\u0102";
>> const char16_t* u16_2 = u"" "\u0102";
>> const char16_t* u16_3 = u"" u"\u0102";
>>
>> const char32_t* u32_1 = "" U"\u0102";
>> const char32_t* u32_2 = U"" "\u0102";
>> const char32_t* u32_3 = U"" U"\u0102";
>>
>> const wchar_t* w_1 = "" L"\u0102";
>> const wchar_t* w_2 = L"" "\u0102";
>> const wchar_t* w_3 = L"" L"\u0102";
>
> GCC and Clang produce the same sequence of encoded code units for each
> combination of /encoding-prefix/ encoded according to the
> /encoding-prefix/.  Visual C++ produces the same code unit sequence
> for the *_1 and *_3 variants, but not for the *_2 variants; the latter
> have their second string literal component first encoded to the
> execution character set (substituting a '?' character if the character
> is not representable in that character set), and then re-encodes that
> code unit sequence (interpreting it according to the current locale
> the compiler is operating in) using the encoding indicated by the
> /encoding-prefix/ of the other string literal component.  The double
> encoding behavior of the Visual C++ compiler, and in particular the
> use of the wrong source encoding for the second conversion, are
> presumably unintended behaviors.
>
> In addition to the above, translation phase 5 states that conversion
> is unconditionally to the execution character set, but that is
> obviously incorrect for string literals that have an /encoding-prefix/
> present.
>
> Swapping translation phases 5 and 6 is not a viable resolution because
> that would interfere with processing of escape sequences.  `"\33" "3"`
> is not intended to be the same as `"\333"` and separate processing of
> string literal components is necessary in some cases involving
> /hexadecimal-escape-sequence/. While the /octal-escape-sequence/ case
> can be worked around with `"\0333"`, the string intended by `"\xab"
> "c"` cannot be represented as a single string literal.
>
> Richard Smith suggested a possible resolution at
> https://groups.google.com/a/isocpp.org/forum/#!msg/std-discussion/qYf6treuLmY/dljWwyawCwAJ:
>
>> Swapping phase 5 and 6 is certainly wrong. See the example in
>> [lex.string]p12:
>>
>> "[Example:
>> "\xA" "B"
>> contains the two characters ’\xA’ and ’B’ after concatenation (and not
>> the single hexadecimal character
>> ’\xAB’). — end example]
>>
>> If you concatenate and then interpret escape sequences, you
>> misinterpret escape sequences.
>>
>> Instead, I think we should remove phases 5 and 6 entirely, parse one
>> or more string-literal tokens as a string literal expression, and only
>> perform the translation from the contents of the string literal tokens
>> into characters in the execution character set as part of specifying
>> the semantics of a string literal expression.
>
> Tom.
>
>
> _______________________________________________
> Core mailing list
> Core_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
> Link to this post: http://lists.isocpp.org/core/2020/07/9537.php



SG16 list run by sg16-owner@lists.isocpp.org