C++ Logo

SG16

Advanced search

Subject: Re: [isocpp-core] New core issue: [lex.phases]: The order of translation phases 5 and 6 contradict the [lex.string] specification for concatenation of string literals
From: Tom Honermann (tom_at_[hidden])
Date: 2020-08-23 23:01:39


Thanks, Mike, I see this has been filed as CWG #2455.  Much appreciated!

Tom.

On 8/11/20 8:28 AM, William M. (Mike) Miller via Core wrote:
> On Mon, Aug 10, 2020 at 10:40 PM Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> Mike, can you please provide an update on this request?
>
>
> I'm  hoping to get a new revision of the issues list out in time for
> the CWG teleconference next Monday.
>
> Tom.
>
> On 7/13/20 10:21 PM, Tom Honermann via Core wrote:
>> Mike, can you please confirm receipt of this issue request?
>>
>> Tom.
>>
>> On 7/2/20 5:22 PM, Tom Honermann via Core wrote:
>>>
>>> This issue was originally reported in
>>> https://lists.isocpp.org/core/2019/03/5770.php, but it looks
>>> like a core issue was never recorded. This request obsoletes the
>>> previous one.
>>>
>>> (Mike, could we please, pretty please with a cup of moonshine
>>> and a plate of chocolate truffles, get a new public facing
>>> publication of the active issues list?  At present, we can't
>>> rely on https://wg21.link directing users to a paper with
>>> recently created issues or up to date status information)
>>>
>>> [lex.phases] <http://eel.is/c++draft/lex.phases> states that
>>> string literals are converted to the execution character set
>>> prior to being concatenated:
>>>
>>>> 5.Each basic source character set member in a
>>>> /character-literal/ or a /string-literal/, as well as each
>>>> escape sequence and /universal-character-name/ in a
>>>> /character-literal/ or a non-raw string literal, is converted
>>>> to the corresponding member of the execution character set
>>>> ([lex.ccon] <http://eel.is/c++draft/lex.ccon>, [lex.string]
>>>> <http://eel.is/c++draft/lex.string>); if there is no
>>>> corresponding member, it is converted to an
>>>> implementation-defined member other than the null (wide) character.
>>>> 6.Adjacent string literal tokens are concatenated.
>>> However, [lex.string]p11 <http://eel.is/c++draft/lex.string#11>
>>> specifies that encoding conversion is dependent on the presence
>>> of /encoding-prefix/es on the set of string literals that will
>>> be concatenated:
>>>
>>>> 11 In translation phase 6 ([lex.phases]
>>>> <http://eel.is/c++draft/lex.phases>), adjacent
>>>> /string-literal/s are concatenated. If both /string-literal/s
>>>> have the same /encoding-prefix/, the resulting concatenated
>>>> /string-literal/ has that /encoding-prefix/.  If one
>>>> /string-literal/ has no /encoding-prefix/, it is treated as a
>>>> /string-literal/ of the same /encoding-prefix/ as the other
>>>> operand.  If a UTF-8 string literal token is adjacent to a wide
>>>> string literal token, the program is ill-formed.  Any other
>>>> concatenations are conditionally-supported with
>>>> implementation-defined behavior.  [/Note:/ This concatenation
>>>> is an interpretation, not a conversion.  Because the
>>>> interpretation happens in translation phase 6 (after each
>>>> character from a string-literal has been translated into a
>>>> value from the appropriate character set), a /string-literal/'s
>>>> initial rawness has no effect on the interpretation or
>>>> well-formedness of the concatenation.  — /end note/ ]  Table 11
>>>> has some examples of valid concatenations.
>>>>
>>>> Table 11
>>>> <http://eel.is/c++draft/lex.string#tab:lex.string.concat>:
>>>> String literal concatenations
>>>> [tab:lex.string.concat]
>>>>
>>>>
>>>> Source Means
>>>> u"a" u"b"
>>>> u"a" "b"
>>>> "a" u"b"
>>>>
>>>>
>>>> u"ab"
>>>> u"ab"
>>>> u"ab"
>>>>
>>>>
>>>> Source Means
>>>> U"a" U"b"
>>>> U"a" "b"
>>>> "a" U"b"
>>>>
>>>>
>>>> U"ab"
>>>> U"ab"
>>>> U"ab"
>>>>
>>>>
>>>> Source Means
>>>> L"a" L"b"
>>>> L"a" "b"
>>>> "a" L"b"
>>>>
>>>>
>>>> L"ab"
>>>> L"ab"
>>>> L"ab"
>>>>
>>>>
>>>>
>>>> Characters in concatenated strings are kept distinct.
>>>>
>>>> [/Example:/
>>>>
>>>>    "\xA" "B"
>>>>
>>>> contains the two characters '\xA' and 'B' after concatenation
>>>> (and not the single hexadecimal character '\xAB').  — /end
>>>> example/ ]
>>>
>>> The intent expressed by the prose and the note seems to be that
>>> string literals with different /encoding-prefix/es be separately
>>> converted and then joined together, such that the resulting
>>> string literal potentially consists of code unit sequences
>>> corresponding to different character encodings.  However, that
>>> conflicts with the intent expressed by the table that specifies
>>> that, for example, `u"a" "b"` means the same as `u"ab"`.
>>>
>>> There is implementation divergence.  GCC and Clang implement the
>>> intent expressed in the table, but Visual C++ implements ...
>>> something else.
>>>
>>> The difference is illustrated at
>>> https://msvc.godbolt.org/z/Dcrgda using the code below:
>>>
>>>> const char8_t* u8_1 = "" u8"\u0102";
>>>> const char8_t* u8_2 = u8"" "\u0102";
>>>> const char8_t* u8_3 = u8"" u8"\u0102";
>>>>
>>>> const char16_t* u16_1 = "" u"\u0102";
>>>> const char16_t* u16_2 = u"" "\u0102";
>>>> const char16_t* u16_3 = u"" u"\u0102";
>>>>
>>>> const char32_t* u32_1 = "" U"\u0102";
>>>> const char32_t* u32_2 = U"" "\u0102";
>>>> const char32_t* u32_3 = U"" U"\u0102";
>>>>
>>>> const wchar_t* w_1 = "" L"\u0102";
>>>> const wchar_t* w_2 = L"" "\u0102";
>>>> const wchar_t* w_3 = L"" L"\u0102";
>>>
>>> GCC and Clang produce the same sequence of encoded code units
>>> for each combination of /encoding-prefix/ encoded according to
>>> the /encoding-prefix/. Visual C++ produces the same code unit
>>> sequence for the *_1 and *_3 variants, but not for the *_2
>>> variants; the latter have their second string literal component
>>> first encoded to the execution character set (substituting a '?'
>>> character if the character is not representable in that
>>> character set), and then re-encodes that code unit sequence
>>> (interpreting it according to the current locale the compiler is
>>> operating in) using the encoding indicated by the
>>> /encoding-prefix/ of the other string literal component.  The
>>> double encoding behavior of the Visual C++ compiler, and in
>>> particular the use of the wrong source encoding for the second
>>> conversion, are presumably unintended behaviors.
>>>
>>> In addition to the above, translation phase 5 states that
>>> conversion is unconditionally to the execution character set,
>>> but that is obviously incorrect for string literals that have an
>>> /encoding-prefix/ present.
>>>
>>> Swapping translation phases 5 and 6 is not a viable resolution
>>> because that would interfere with processing of escape
>>> sequences.  `"\33" "3"` is not intended to be the same as
>>> `"\333"` and separate processing of string literal components is
>>> necessary in some cases involving /hexadecimal-escape-sequence/.
>>> While the /octal-escape-sequence/ case can be worked around with
>>> `"\0333"`, the string intended by `"\xab" "c"` cannot be
>>> represented as a single string literal.
>>>
>>> Richard Smith suggested a possible resolution at
>>> https://groups.google.com/a/isocpp.org/forum/#!msg/std-discussion/qYf6treuLmY/dljWwyawCwAJ:
>>>
>>>> Swapping phase 5 and 6 is certainly wrong. See the example in
>>>> [lex.string]p12:
>>>>
>>>> "[Example:
>>>> "\xA" "B"
>>>> contains the two characters ’\xA’ and ’B’ after concatenation
>>>> (and not
>>>> the single hexadecimal character
>>>> ’\xAB’). — end example]
>>>>
>>>> If you concatenate and then interpret escape sequences, you
>>>> misinterpret escape sequences.
>>>>
>>>> Instead, I think we should remove phases 5 and 6 entirely,
>>>> parse one
>>>> or more string-literal tokens as a string literal expression,
>>>> and only
>>>> perform the translation from the contents of the string literal
>>>> tokens
>>>> into characters in the execution character set as part of
>>>> specifying
>>>> the semantics of a string literal expression.
>>>
>>> Tom.
>>>
>>>
>>> _______________________________________________
>>> Core mailing list
>>> Core_at_[hidden] <mailto:Core_at_[hidden]>
>>> Subscription:https://lists.isocpp.org/mailman/listinfo.cgi/core
>>> Link to this post:http://lists.isocpp.org/core/2020/07/9537.php
>>
>>
>>
>> _______________________________________________
>> Core mailing list
>> Core_at_[hidden] <mailto:Core_at_[hidden]>
>> Subscription:https://lists.isocpp.org/mailman/listinfo.cgi/core
>> Link to this post:http://lists.isocpp.org/core/2020/07/9569.php
>
>
>
>
> --
> William M. (Mike) Miller | Edison Design Group
> william.m.miller_at_[hidden] <mailto:william.m.miller_at_[hidden]>
>
> _______________________________________________
> Core mailing list
> Core_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
> Link to this post: http://lists.isocpp.org/core/2020/08/9644.php



SG16 list run by sg16-owner@lists.isocpp.org