C++ Logo

SG16

Advanced search

Subject: Re: Concatenating unicode string literals
From: Corentin Jabot (corentinjabot_at_[hidden])
Date: 2020-07-08 12:50:37


On Wed, 8 Jul 2020 at 18:54, Tom Honermann <tom_at_[hidden]> wrote:

> On 7/8/20 12:22 PM, Corentin Jabot via SG16 wrote:
>
>
>
> On Wed, 8 Jul 2020 at 18:07, Tom Honermann <tom_at_[hidden]> wrote:
>
>> On 7/8/20 8:45 AM, Corentin Jabot wrote:
>>
>>
>>
>> On Wed, 8 Jul 2020 at 13:56, Alisdair Meredith <alisdairm_at_[hidden]> wrote:
>>
>>> Well, u8”” L”” is portably ill-formed, and should be diagnosed :)
>>>
>>> The corner cases that are conditionally supported with an
>>> implementation defined value are:
>>>
>>> u8”” u”"
>>> u8”” U””
>>> u”” u8””
>>> u”” U””
>>> u"” L””
>>> U”” u8””
>>> U”” u””
>>> U”” L””
>>> L”” u””
>>> L”” U”"
>>>
>>
>> Let's entertain that this should, for some reason, not be ill-formed.
>> What should it do?
>> Can you find a set of rules that people will not trip over?
>> Is it like rock paper scissors but with u8, u and U instead?
>>
>> I can sort-of kind-of see a use case for allowing one of u"" or U"" to be
>> concatenated with L"" as a conditionally-supported feature when the wide
>> execution encoding is a match (UTF-16 for u"" or UTF-32 for U""). This
>> might be useful in strange situations where string concatenation is desired
>> and one of the components is provided by a macro expansion. The question
>> then is what the type of the string literal is. The only model I can see
>> working there is to adopt the type from the first component and ignore the
>> encoding-prefix from the remaining ones.
>>
>> I would not be heart broken over breaking code that does this.
>>
>>
>> I understand that we can make _anything_ well-formed or
>> conditionally supported.
>> We should ask ourselves why?
>> Is there a reason to allow that, which surpasses the cost of the added
>> complexity?
>> We can ( and I believe we should) make that ill-formed.
>> (Implementations sensibly do not support these things anyway)
>>
>> Can you summarize what implementations do today? I haven't researched.
>>
> They do not support the combinations not mandated by the standard
>
> Which implementations did you check? Clang, gcc, MSVC, and icc?
>

yes https://compiler-explorer.com/z/4NDo-4

> Tom.
>
>
>
>>
>>
>>> I would like all of the combinations that do not involve L”” to have a
>>> well defined value, whether or not we specify the encoding.
>>>
>> Well-defined or perhaps ill-formed?
>>
>> This is another case where I think the feature makes little or no sense,
>> but, unless shown otherwise, doesn't cause problems in practice and should
>> therefore be treated as a low priority issue relative to other things we
>> could be working on.
>>
>>
>>> I do not care to further constrain the L”” forms at this point.
>>>
>> +1.
>>
>> Tom.
>>
>>
>>> AlisdairM
>>>
>>>
>>> On Jul 8, 2020, at 12:35, Corentin Jabot <corentinjabot_at_[hidden]>
>>> wrote:
>>>
>>>
>>>
>>> On Wed, 8 Jul 2020 at 13:09, Alisdair Meredith via SG16 <
>>> sg16_at_[hidden]> wrote:
>>>
>>>> After taking another look over P2029 resolving a few core issues,
>>>> I am further concerned by [lex.string]p11, which states (among
>>>> other things) that concatenation of unicode string literals with
>>>> different encoding-prefixes is conditionally supported with
>>>> implementation-defined behavior. That seems a little to free for
>>>> my tastes.
>>>>
>>>
>>> +1
>>>
>>>
>>>> I can buy conditionally supported, although see no harm in
>>>> requiring it for any combination of unicode encoding prefixes.
>>>> I am concerned about the implementation-defined behavior:
>>>> the end result should be the result of concatenating the
>>>> transcoded representation of each of the strings into a common
>>>> encoding, corresponding to one of the involved encoding
>>>> prefixes. I am happy to defer to implementations to choose
>>>> between UTF8/16/32, or we could define a canonical prefered
>>>> ordering among those choices.
>>>>
>>>> Does this seem worth calling out (yet another SG16 paper) or
>>>> better left alone, as we already have way too much busy work
>>>> on this groups plate, and implementation will most likely do the
>>>> right thing anyway?
>>>>
>>>
>>> It's called out in P2178
>>>
>>> from a user perspective, we allow to manything for any mental model to
>>> work
>>>
>>> u8"" "" and "" u8"" are both utf-8 strings
>>>
>>> what
>>>
>>> u8"" L""
>>>
>>> is supposed to be? utf-8? wide? How can I tell ? and it's not portable?
>>>
>>> There are only 2 models that make sense imo
>>> Either
>>> - Only the first string can have a prefix
>>> - Only one of the string can have a prefix
>>>
>>> Note that, despite the terribly misleading wording the strings that are
>>> concatenated do *not* have an associated encoding before concatenation,
>>> the compiler choses a prefix after concatenation (this is yet something
>>> needing fixing, i think tom started a thread a few days ago back)
>>>
>>> u8"" L"" is not utf-8 + wide (what would that mean?) but a string
>>> interpreted as either utf-8 or wide depending on the whims of the
>>> implementation.
>>>
>>> I think it needs fixing indeed :)
>>>
>>>
>>>>
>>>> (I am not overly concerned about specifying concatenation for
>>>> narrow/wide string literals with unicode string literals, which
>>>> can remain conditionally supported with implementation-defined
>>>> values.)
>>>>
>>>> AlisdairM
>>>> --
>>>> SG16 mailing list
>>>> SG16_at_[hidden]
>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>
>>>
>>>
>>
>
>



SG16 list run by sg16-owner@lists.isocpp.org