C++ Logo

SG16

Advanced search

Subject: Re: Concatenating unicode string literals
From: Tom Honermann (tom_at_[hidden])
Date: 2020-07-08 11:54:47


On 7/8/20 12:22 PM, Corentin Jabot via SG16 wrote:
>
>
> On Wed, 8 Jul 2020 at 18:07, Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> On 7/8/20 8:45 AM, Corentin Jabot wrote:
>>
>>
>> On Wed, 8 Jul 2020 at 13:56, Alisdair Meredith <alisdairm_at_[hidden]
>> <mailto:alisdairm_at_[hidden]>> wrote:
>>
>> Well, u8”” L”” is portably ill-formed, and should be
>> diagnosed :)
>>
>> The corner cases that are conditionally supported with an
>> implementation defined value are:
>>
>>   u8”” u”"
>>   u8”” U””
>>   u”” u8””
>>   u”” U””
>>   u"” L””
>>   U”” u8””
>>   U”” u””
>>   U”” L””
>>   L”” u””
>>   L”” U”"
>>
>>
>> Let's entertain that this should, for some reason, not be ill-formed.
>> What should it do?
>> Can you find a set of rules that people will not trip over?
>> Is it like rock paper scissors but with u8, u and U instead?
>
> I can sort-of kind-of see a use case for allowing one of u"" or
> U"" to be concatenated with L"" as a conditionally-supported
> feature when the wide execution encoding is a match (UTF-16 for
> u"" or UTF-32 for U""). This might be useful in strange situations
> where string concatenation is desired and one of the components is
> provided by a macro expansion.  The question then is what the type
> of the string literal is.  The only model I can see working there
> is to adopt the type from the first component and ignore the
> encoding-prefix from the remaining ones.
>
> I would not be heart broken over breaking code that does this.
>
>>
>> I understand that we can make _anything_ well-formed or
>> conditionally supported.
>> We should ask ourselves why?
>> Is there a reason to allow that, which surpasses the cost of the
>> added complexity?
>> We can ( and I believe we should) make that ill-formed.
>> (Implementations sensibly do not support these things anyway)
>
> Can you summarize what implementations do today?  I haven't
> researched.
>
> They do not support the combinations not mandated by the standard

Which implementations did you check?  Clang, gcc, MSVC, and icc?

Tom.

>>
>>
>> I would like all of the combinations that do not involve L””
>> to have a
>> well defined value, whether or not we specify the encoding.
>>
> Well-defined or perhaps ill-formed?
>
> This is another case where I think the feature makes little or no
> sense, but, unless shown otherwise, doesn't cause problems in
> practice and should therefore be treated as a low priority issue
> relative to other things we could be working on.
>
>>
>> I do not care to further constrain the L”” forms at this point.
>>
> +1.
>
> Tom.
>
>>
>> AlisdairM
>>
>>
>>> On Jul 8, 2020, at 12:35, Corentin Jabot
>>> <corentinjabot_at_[hidden] <mailto:corentinjabot_at_[hidden]>>
>>> wrote:
>>>
>>>
>>>
>>> On Wed, 8 Jul 2020 at 13:09, Alisdair Meredith via SG16
>>> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>>>
>>> After taking another look over P2029 resolving a few
>>> core issues,
>>> I am further concerned by [lex.string]p11, which states
>>> (among
>>> other things) that concatenation of unicode string
>>> literals with
>>> different encoding-prefixes is conditionally supported with
>>> implementation-defined behavior. That seems a little to
>>> free for
>>> my tastes.
>>>
>>>
>>> +1
>>>
>>> I can buy conditionally supported, although see no harm in
>>> requiring it for any combination of unicode encoding
>>> prefixes.
>>> I am concerned about the implementation-defined behavior:
>>> the end result should be the result of concatenating the
>>> transcoded representation of each of the strings into a
>>> common
>>> encoding, corresponding to one of the involved encoding
>>> prefixes.  I am happy to defer to implementations to choose
>>> between UTF8/16/32, or we could define a canonical prefered
>>> ordering among those choices.
>>>
>>> Does this seem worth calling out (yet another SG16 paper) or
>>> better left alone, as we already have way too much busy work
>>> on this groups plate, and implementation will most
>>> likely do the
>>> right thing anyway?
>>>
>>>
>>> It's called out in P2178
>>>
>>> from a user perspective, we allow to manything for any
>>> mental model to work
>>>
>>> u8"" "" and "" u8"" are both utf-8 strings
>>>
>>> what
>>>
>>> u8"" L""
>>>
>>> is supposed to be? utf-8?  wide? How can I tell ? and it's
>>> not portable?
>>>
>>> There are only 2 models that make sense imo
>>> Either
>>> - Only the first string can have a prefix
>>> - Only one of the string can have a prefix
>>>
>>> Note that, despite the terribly misleading wording the
>>> strings that are concatenated do *not* have an associated
>>> encoding before concatenation,
>>> the compiler choses a prefix after concatenation (this is
>>> yet something needing fixing, i think tom started a thread a
>>> few days ago back)
>>>
>>> u8"" L""  is not utf-8 + wide (what would that mean?) but a
>>> string interpreted as either utf-8 or wide depending on
>>> the whims of the implementation.
>>>
>>> I think it needs fixing indeed :)
>>>
>>>
>>> (I am not overly concerned about specifying
>>> concatenation for
>>> narrow/wide string literals with unicode string
>>> literals, which
>>> can remain conditionally supported with
>>> implementation-defined
>>> values.)
>>>
>>> AlisdairM
>>> --
>>> SG16 mailing list
>>> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>
>>
>
>



SG16 list run by sg16-owner@lists.isocpp.org