sg16: Re: [SG16] Concatenating unicode string literals

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 8 Jul 2020 12:07:45 -0400

On 7/8/20 8:45 AM, Corentin Jabot wrote:
>
>
> On Wed, 8 Jul 2020 at 13:56, Alisdair Meredith <alisdairm_at_[hidden]
> <mailto:alisdairm_at_[hidden]>> wrote:
>
> Well, u8”” L”” is portably ill-formed, and should be diagnosed :)
>
> The corner cases that are conditionally supported with an
> implementation defined value are:
>
> u8”” u”"
> u8”” U””
> u”” u8””
> u”” U””
> u"” L””
> U”” u8””
> U”” u””
> U”” L””
> L”” u””
> L”” U”"
>
>
> Let's entertain that this should, for some reason, not be ill-formed.
> What should it do?
> Can you find a set of rules that people will not trip over?
> Is it like rock paper scissors but with u8, u and U instead?

I can sort-of kind-of see a use case for allowing one of u"" or U"" to
be concatenated with L"" as a conditionally-supported feature when the
wide execution encoding is a match (UTF-16 for u"" or UTF-32 for U"").
This might be useful in strange situations where string concatenation is
desired and one of the components is provided by a macro expansion. The
question then is what the type of the string literal is. The only model
I can see working there is to adopt the type from the first component
and ignore the encoding-prefix from the remaining ones.

I would not be heart broken over breaking code that does this.

>
> I understand that we can make _anything_ well-formed or
> conditionally supported.
> We should ask ourselves why?
> Is there a reason to allow that, which surpasses the cost of the added
> complexity?
> We can ( and I believe we should) make that ill-formed.
> (Implementations sensibly do not support these things anyway)

Can you summarize what implementations do today? I haven't researched.

>
>
> I would like all of the combinations that do not involve L”” to have a
> well defined value, whether or not we specify the encoding.
>
Well-defined or perhaps ill-formed?

This is another case where I think the feature makes little or no sense,
but, unless shown otherwise, doesn't cause problems in practice and
should therefore be treated as a low priority issue relative to other
things we could be working on.

>
> I do not care to further constrain the L”” forms at this point.
>
+1.

Tom.

>
> AlisdairM
>
>
>> On Jul 8, 2020, at 12:35, Corentin Jabot <corentinjabot_at_[hidden]
>> <mailto:corentinjabot_at_[hidden]>> wrote:
>>
>>
>>
>> On Wed, 8 Jul 2020 at 13:09, Alisdair Meredith via SG16
>> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>>
>> After taking another look over P2029 resolving a few core issues,
>> I am further concerned by [lex.string]p11, which states (among
>> other things) that concatenation of unicode string literals with
>> different encoding-prefixes is conditionally supported with
>> implementation-defined behavior. That seems a little to free for
>> my tastes.
>>
>>
>> +1
>>
>> I can buy conditionally supported, although see no harm in
>> requiring it for any combination of unicode encoding prefixes.
>> I am concerned about the implementation-defined behavior:
>> the end result should be the result of concatenating the
>> transcoded representation of each of the strings into a common
>> encoding, corresponding to one of the involved encoding
>> prefixes. I am happy to defer to implementations to choose
>> between UTF8/16/32, or we could define a canonical prefered
>> ordering among those choices.
>>
>> Does this seem worth calling out (yet another SG16 paper) or
>> better left alone, as we already have way too much busy work
>> on this groups plate, and implementation will most likely do the
>> right thing anyway?
>>
>>
>> It's called out in P2178
>>
>> from a user perspective, we allow to manything for any mental
>> model to work
>>
>> u8"" "" and "" u8"" are both utf-8 strings
>>
>> what
>>
>> u8"" L""
>>
>> is supposed to be? utf-8? wide? How can I tell ? and it's not
>> portable?
>>
>> There are only 2 models that make sense imo
>> Either
>> - Only the first string can have a prefix
>> - Only one of the string can have a prefix
>>
>> Note that, despite the terribly misleading wording the strings
>> that are concatenated do *not* have an associated encoding before
>> concatenation,
>> the compiler choses a prefix after concatenation (this is yet
>> something needing fixing, i think tom started a thread a few days
>> ago back)
>>
>> u8"" L"" is not utf-8 + wide (what would that mean?) but a
>> string interpreted as either utf-8 or wide depending on the whims
>> of the implementation.
>>
>> I think it needs fixing indeed :)
>>
>>
>> (I am not overly concerned about specifying concatenation for
>> narrow/wide string literals with unicode string literals, which
>> can remain conditionally supported with implementation-defined
>> values.)
>>
>> AlisdairM
>> --
>> SG16 mailing list
>> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>

Received on 2020-07-08 11:11:02