sg16: Re: [SG16] Concatenating unicode string literals

From: Tom Honermann <tom_at_[hidden]>
Date: Thu, 9 Jul 2020 12:28:22 -0400

On 7/8/20 3:15 PM, Jens Maurer wrote:
> On 08/07/2020 13.09, Alisdair Meredith via SG16 wrote:
>> After taking another look over P2029 resolving a few core issues,
>> I am further concerned by [lex.string]p11, which states (among
>> other things) that concatenation of unicode string literals with
>> different encoding-prefixes is conditionally supported with
>> implementation-defined behavior. That seems a little to free for
>> my tastes.
>>
>> I can buy conditionally supported, although see no harm in
>> requiring it for any combination of unicode encoding prefixes.
>> I am concerned about the implementation-defined behavior:
>> the end result should be the result of concatenating the
>> transcoded representation of each of the strings into a common
>> encoding, corresponding to one of the involved encoding
>> prefixes.
> That's not how it works. You first pick a common
> encoding-prefix for the concatenation (whatever it is),
> and then you encode the entire (concatenated) string
> using that encoding-prefix.

I think Jens' description matches the intent expressed in
[lex.string]p11 <http://eel.is/c++draft/lex.string#11>, though I have
long struggled with the intent of the note there:

> [ /Note:/ This concatenation is an interpretation, not a conversion.
Because the interpretation happens in translation phase 6 (after each
character from a string-literal has been translated into a value from
the appropriate character set), a string-literal's initial rawness has
no effect on the interpretation or well-formedness of the concatenation.
— /end note /]

However, Alisdair's description appears to match the Visual C++
implementation as previously discussed in
https://lists.isocpp.org/sg16/2020/07/1699.php and as exhibited at
https://msvc.godbolt.org/z/7KcMs5 (including for wide literals, and
including the same bug where the wrong encoding is used for the second
conversion).

In cases where character conversion is non-lossy through the various
encodings, the difference is unobservable.

I prefer the design that Jens described since it avoids the additional
conversions.

>
>> I am happy to defer to implementations to choose
>> between UTF8/16/32, or we could define a canonical prefered
>> ordering among those choices.
> Since all four well-known C++ implementations appear to
> produce an error for the test cases at
> https://compiler-explorer.com/z/4NDo-4
> I'm fine with specifying these as ill-formed.

I'm fine with that as well.

Jens, would you consider such a change as evolutionary given that we
don't know of any implementations (so far) that actually support these
concatenations? Would it be reasonable to take this issue straight to
core (with JF's blessing of course)? The only arguments I can see
against making this change are 1) Not a great use of our time to excise
a weird conditionally-supported feature that is not implemented
anywhere, and 2) additional drift from C.

JeanHeyd has already reached out to WG14 to ask for their input on
making these ill-formed.

>
> There is no (technical) need to support these cases,
> and nobody has written code like that (because
> no compiler accepts it), so let's nix it.
>
> From a procedural standpoint, P2029 produces enough
> churn in the general area that I'd like to see P2029
> hit the working draft before future papers in that
> area are processed.

Me too.

Tom.

Received on 2020-07-09 11:31:41