sg16: Re: [SG16] [isocpp-core] New core issue: [lex.phases]: The order of translation phases 5 and 6 contradict the [lex.string] specification for concatenation of string literals

From: William M. (Mike) Miller <"William>
Date: Tue, 11 Aug 2020 08:28:53 -0400

On Mon, Aug 10, 2020 at 10:40 PM Tom Honermann <tom_at_[hidden]> wrote:

> Mike, can you please provide an update on this request?
>

I'm hoping to get a new revision of the issues list out in time for the
CWG teleconference next Monday.

> Tom.
>
> On 7/13/20 10:21 PM, Tom Honermann via Core wrote:
>
> Mike, can you please confirm receipt of this issue request?
>
> Tom.
>
> On 7/2/20 5:22 PM, Tom Honermann via Core wrote:
>
> This issue was originally reported in
> https://lists.isocpp.org/core/2019/03/5770.php, but it looks like a core
> issue was never recorded. This request obsoletes the previous one.
>
> (Mike, could we please, pretty please with a cup of moonshine and a plate
> of chocolate truffles, get a new public facing publication of the active
> issues list? At present, we can't rely on https://wg21.link directing
> users to a paper with recently created issues or up to date status
> information)
>
> [lex.phases] <http://eel.is/c++draft/lex.phases> states that string
> literals are converted to the execution character set prior to being
> concatenated:
>
> 5.Each basic source character set member in a *character-literal* or a
> *string-literal*, as well as each escape sequence and
> *universal-character-name* in a *character-literal* or a non-raw string
> literal, is converted to the corresponding member of the execution
> character set ([lex.ccon] <http://eel.is/c++draft/lex.ccon>, [lex.string]
> <http://eel.is/c++draft/lex.string>); if there is no corresponding
> member, it is converted to an implementation-defined member other than the
> null (wide) character.
>
> 6.Adjacent string literal tokens are concatenated.
>
> However, [lex.string]p11 <http://eel.is/c++draft/lex.string#11> specifies
> that encoding conversion is dependent on the presence of *encoding-prefix*es
> on the set of string literals that will be concatenated:
>
> 11 In translation phase 6 ([lex.phases]
> <http://eel.is/c++draft/lex.phases>), adjacent *string-literal*s are
> concatenated. If both *string-literal*s have the same *encoding-prefix*,
> the resulting concatenated *string-literal* has that *encoding-prefix*.
> If one *string-literal* has no *encoding-prefix*, it is treated as a
> *string-literal* of the same *encoding-prefix* as the other operand. If
> a UTF-8 string literal token is adjacent to a wide string literal token,
> the program is ill-formed. Any other concatenations are
> conditionally-supported with implementation-defined behavior. [*Note:*
> This concatenation is an interpretation, not a conversion. Because the
> interpretation happens in translation phase 6 (after each character from a
> string-literal has been translated into a value from the appropriate
> character set), a *string-literal*'s initial rawness has no effect on the
> interpretation or well-formedness of the concatenation. — *end note* ]
> Table 11 has some examples of valid concatenations.
>
> Table 11 <http://eel.is/c++draft/lex.string#tab:lex.string.concat>:
> String literal concatenations
> [tab:lex.string.concat]
> Source Means
> u"a" u"b"
> u"a" "b"
> "a" u"b"
> u"ab"
> u"ab"
> u"ab"
> Source Means
> U"a" U"b"
> U"a" "b"
> "a" U"b"
> U"ab"
> U"ab"
> U"ab"
> Source Means
> L"a" L"b"
> L"a" "b"
> "a" L"b"
> L"ab"
> L"ab"
> L"ab"
>
>
> Characters in concatenated strings are kept distinct.
>
> [*Example:*
>
> "\xA" "B"
>
> contains the two characters '\xA' and 'B' after concatenation (and not the
> single hexadecimal character '\xAB'). — *end example* ]
>
> The intent expressed by the prose and the note seems to be that string
> literals with different *encoding-prefix*es be separately converted and
> then joined together, such that the resulting string literal potentially
> consists of code unit sequences corresponding to different character
> encodings. However, that conflicts with the intent expressed by the table
> that specifies that, for example, `u"a" "b"` means the same as `u"ab"`.
>
> There is implementation divergence. GCC and Clang implement the intent
> expressed in the table, but Visual C++ implements ... something else.
>
> The difference is illustrated at https://msvc.godbolt.org/z/Dcrgda using
> the code below:
>
> const char8_t* u8_1 = "" u8"\u0102";
> const char8_t* u8_2 = u8"" "\u0102";
> const char8_t* u8_3 = u8"" u8"\u0102";
>
> const char16_t* u16_1 = "" u"\u0102";
> const char16_t* u16_2 = u"" "\u0102";
> const char16_t* u16_3 = u"" u"\u0102";
>
> const char32_t* u32_1 = "" U"\u0102";
> const char32_t* u32_2 = U"" "\u0102";
> const char32_t* u32_3 = U"" U"\u0102";
>
> const wchar_t* w_1 = "" L"\u0102";
> const wchar_t* w_2 = L"" "\u0102";
> const wchar_t* w_3 = L"" L"\u0102";
>
> GCC and Clang produce the same sequence of encoded code units for each
> combination of *encoding-prefix* encoded according to the
> *encoding-prefix*. Visual C++ produces the same code unit sequence for
> the *_1 and *_3 variants, but not for the *_2 variants; the latter have
> their second string literal component first encoded to the execution
> character set (substituting a '?' character if the character is not
> representable in that character set), and then re-encodes that code unit
> sequence (interpreting it according to the current locale the compiler is
> operating in) using the encoding indicated by the *encoding-prefix* of
> the other string literal component. The double encoding behavior of the
> Visual C++ compiler, and in particular the use of the wrong source encoding
> for the second conversion, are presumably unintended behaviors.
>
> In addition to the above, translation phase 5 states that conversion is
> unconditionally to the execution character set, but that is obviously
> incorrect for string literals that have an *encoding-prefix* present.
>
> Swapping translation phases 5 and 6 is not a viable resolution because
> that would interfere with processing of escape sequences. `"\33" "3"` is
> not intended to be the same as `"\333"` and separate processing of string
> literal components is necessary in some cases involving
> *hexadecimal-escape-sequence*. While the *octal-escape-sequence* case
> can be worked around with `"\0333"`, the string intended by `"\xab" "c"`
> cannot be represented as a single string literal.
>
> Richard Smith suggested a possible resolution at
> https://groups.google.com/a/isocpp.org/forum/#!msg/std-discussion/qYf6treuLmY/dljWwyawCwAJ
> :
>
> Swapping phase 5 and 6 is certainly wrong. See the example in
> [lex.string]p12:
>
> "[Example:
> "\xA" "B"
> contains the two characters ’\xA’ and ’B’ after concatenation (and not
> the single hexadecimal character
> ’\xAB’). — end example]
>
> If you concatenate and then interpret escape sequences, you
> misinterpret escape sequences.
>
> Instead, I think we should remove phases 5 and 6 entirely, parse one
> or more string-literal tokens as a string literal expression, and only
> perform the translation from the contents of the string literal tokens
> into characters in the execution character set as part of specifying
> the semantics of a string literal expression.
>
> Tom.
>
> _______________________________________________
> Core mailing listCore_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
> Link to this post: http://lists.isocpp.org/core/2020/07/9537.php
>
>
>
> _______________________________________________
> Core mailing listCore_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
> Link to this post: http://lists.isocpp.org/core/2020/07/9569.php
>
>
>

-- 
William M. (Mike) Miller | Edison Design Group
william.m.miller_at_[hidden]

Received on 2020-08-11 07:32:29