C++ Logo

SG16

Advanced search

Subject: Re: [isocpp-core] New core issue: [lex.phases]: The order of translation phases 5 and 6 contradict the [lex.string] specification for concatenation of string literals
From: Tom Honermann (tom_at_[hidden])
Date: 2020-08-10 21:40:39


Mike, can you please provide an update on this request?

Tom.

On 7/13/20 10:21 PM, Tom Honermann via Core wrote:
> Mike, can you please confirm receipt of this issue request?
>
> Tom.
>
> On 7/2/20 5:22 PM, Tom Honermann via Core wrote:
>>
>> This issue was originally reported in
>> https://lists.isocpp.org/core/2019/03/5770.php, but it looks like a
>> core issue was never recorded.  This request obsoletes the previous one.
>>
>> (Mike, could we please, pretty please with a cup of moonshine and a
>> plate of chocolate truffles, get a new public facing publication of
>> the active issues list?  At present, we can't rely on
>> https://wg21.link directing users to a paper with recently created
>> issues or up to date status information)
>>
>> [lex.phases] <http://eel.is/c++draft/lex.phases> states that string
>> literals are converted to the execution character set prior to being
>> concatenated:
>>
>>> 5.Each basic source character set member in a /character-literal/ or
>>> a /string-literal/, as well as each escape sequence and
>>> /universal-character-name/ in a /character-literal/ or a non-raw
>>> string literal, is converted to the corresponding member of the
>>> execution character set ([lex.ccon]
>>> <http://eel.is/c++draft/lex.ccon>, [lex.string]
>>> <http://eel.is/c++draft/lex.string>); if there is no corresponding
>>> member, it is converted to an implementation-defined member other
>>> than the null (wide) character.
>>> 6.Adjacent string literal tokens are concatenated.
>> However, [lex.string]p11 <http://eel.is/c++draft/lex.string#11>
>> specifies that encoding conversion is dependent on the presence of
>> /encoding-prefix/es on the set of string literals that will be
>> concatenated:
>>
>>> 11 In translation phase 6 ([lex.phases]
>>> <http://eel.is/c++draft/lex.phases>), adjacent /string-literal/s are
>>> concatenated.  If both /string-literal/s have the same
>>> /encoding-prefix/, the resulting concatenated /string-literal/ has
>>> that /encoding-prefix/. If one /string-literal/ has no
>>> /encoding-prefix/, it is treated as a /string-literal/ of the same
>>> /encoding-prefix/ as the other operand.  If a UTF-8 string literal
>>> token is adjacent to a wide string literal token, the program is
>>> ill-formed.  Any other concatenations are conditionally-supported
>>> with implementation-defined behavior. [/Note:/ This concatenation is
>>> an interpretation, not a conversion.  Because the interpretation
>>> happens in translation phase 6 (after each character from a
>>> string-literal has been translated into a value from the appropriate
>>> character set), a /string-literal/'s initial rawness has no effect
>>> on the interpretation or well-formedness of the concatenation.  —
>>> /end note/ ]  Table 11 has some examples of valid concatenations.
>>>
>>> Table 11 <http://eel.is/c++draft/lex.string#tab:lex.string.concat>:
>>> String literal concatenations
>>> [tab:lex.string.concat]
>>>
>>>
>>> Source Means
>>> u"a" u"b"
>>> u"a" "b"
>>> "a" u"b"
>>>
>>>
>>> u"ab"
>>> u"ab"
>>> u"ab"
>>>
>>>
>>> Source Means
>>> U"a" U"b"
>>> U"a" "b"
>>> "a" U"b"
>>>
>>>
>>> U"ab"
>>> U"ab"
>>> U"ab"
>>>
>>>
>>> Source Means
>>> L"a" L"b"
>>> L"a" "b"
>>> "a" L"b"
>>>
>>>
>>> L"ab"
>>> L"ab"
>>> L"ab"
>>>
>>>
>>>
>>> Characters in concatenated strings are kept distinct.
>>>
>>> [/Example:/
>>>
>>>    "\xA" "B"
>>>
>>> contains the two characters '\xA' and 'B' after concatenation (and
>>> not the single hexadecimal character '\xAB').  — /end example/ ]
>>
>> The intent expressed by the prose and the note seems to be that
>> string literals with different /encoding-prefix/es be separately
>> converted and then joined together, such that the resulting string
>> literal potentially consists of code unit sequences corresponding to
>> different character encodings. However, that conflicts with the
>> intent expressed by the table that specifies that, for example, `u"a"
>> "b"` means the same as `u"ab"`.
>>
>> There is implementation divergence.  GCC and Clang implement the
>> intent expressed in the table, but Visual C++ implements ...
>> something else.
>>
>> The difference is illustrated at https://msvc.godbolt.org/z/Dcrgda
>> using the code below:
>>
>>> const char8_t* u8_1 = "" u8"\u0102";
>>> const char8_t* u8_2 = u8"" "\u0102";
>>> const char8_t* u8_3 = u8"" u8"\u0102";
>>>
>>> const char16_t* u16_1 = "" u"\u0102";
>>> const char16_t* u16_2 = u"" "\u0102";
>>> const char16_t* u16_3 = u"" u"\u0102";
>>>
>>> const char32_t* u32_1 = "" U"\u0102";
>>> const char32_t* u32_2 = U"" "\u0102";
>>> const char32_t* u32_3 = U"" U"\u0102";
>>>
>>> const wchar_t* w_1 = "" L"\u0102";
>>> const wchar_t* w_2 = L"" "\u0102";
>>> const wchar_t* w_3 = L"" L"\u0102";
>>
>> GCC and Clang produce the same sequence of encoded code units for
>> each combination of /encoding-prefix/ encoded according to the
>> /encoding-prefix/.  Visual C++ produces the same code unit sequence
>> for the *_1 and *_3 variants, but not for the *_2 variants; the
>> latter have their second string literal component first encoded to
>> the execution character set (substituting a '?' character if the
>> character is not representable in that character set), and then
>> re-encodes that code unit sequence (interpreting it according to the
>> current locale the compiler is operating in) using the encoding
>> indicated by the /encoding-prefix/ of the other string literal
>> component.  The double encoding behavior of the Visual C++ compiler,
>> and in particular the use of the wrong source encoding for the second
>> conversion, are presumably unintended behaviors.
>>
>> In addition to the above, translation phase 5 states that conversion
>> is unconditionally to the execution character set, but that is
>> obviously incorrect for string literals that have an
>> /encoding-prefix/ present.
>>
>> Swapping translation phases 5 and 6 is not a viable resolution
>> because that would interfere with processing of escape sequences. 
>> `"\33" "3"` is not intended to be the same as `"\333"` and separate
>> processing of string literal components is necessary in some cases
>> involving /hexadecimal-escape-sequence/. While the
>> /octal-escape-sequence/ case can be worked around with `"\0333"`, the
>> string intended by `"\xab" "c"` cannot be represented as a single
>> string literal.
>>
>> Richard Smith suggested a possible resolution at
>> https://groups.google.com/a/isocpp.org/forum/#!msg/std-discussion/qYf6treuLmY/dljWwyawCwAJ:
>>
>>> Swapping phase 5 and 6 is certainly wrong. See the example in
>>> [lex.string]p12:
>>>
>>> "[Example:
>>> "\xA" "B"
>>> contains the two characters ’\xA’ and ’B’ after concatenation (and not
>>> the single hexadecimal character
>>> ’\xAB’). — end example]
>>>
>>> If you concatenate and then interpret escape sequences, you
>>> misinterpret escape sequences.
>>>
>>> Instead, I think we should remove phases 5 and 6 entirely, parse one
>>> or more string-literal tokens as a string literal expression, and only
>>> perform the translation from the contents of the string literal tokens
>>> into characters in the execution character set as part of specifying
>>> the semantics of a string literal expression.
>>
>> Tom.
>>
>>
>> _______________________________________________
>> Core mailing list
>> Core_at_[hidden]
>> Subscription:https://lists.isocpp.org/mailman/listinfo.cgi/core
>> Link to this post:http://lists.isocpp.org/core/2020/07/9537.php
>
>
>
> _______________________________________________
> Core mailing list
> Core_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
> Link to this post: http://lists.isocpp.org/core/2020/07/9569.php



SG16 list run by sg16-owner@lists.isocpp.org