C++ Logo

SG16

Advanced search

Subject: Re: Concatenating unicode string literals
From: Corentin Jabot (corentinjabot_at_[hidden])
Date: 2020-07-08 11:22:53


On Wed, 8 Jul 2020 at 18:07, Tom Honermann <tom_at_[hidden]> wrote:

> On 7/8/20 8:45 AM, Corentin Jabot wrote:
>
>
>
> On Wed, 8 Jul 2020 at 13:56, Alisdair Meredith <alisdairm_at_[hidden]> wrote:
>
>> Well, u8”” L”” is portably ill-formed, and should be diagnosed :)
>>
>> The corner cases that are conditionally supported with an
>> implementation defined value are:
>>
>> u8”” u”"
>> u8”” U””
>> u”” u8””
>> u”” U””
>> u"” L””
>> U”” u8””
>> U”” u””
>> U”” L””
>> L”” u””
>> L”” U”"
>>
>
> Let's entertain that this should, for some reason, not be ill-formed.
> What should it do?
> Can you find a set of rules that people will not trip over?
> Is it like rock paper scissors but with u8, u and U instead?
>
> I can sort-of kind-of see a use case for allowing one of u"" or U"" to be
> concatenated with L"" as a conditionally-supported feature when the wide
> execution encoding is a match (UTF-16 for u"" or UTF-32 for U""). This
> might be useful in strange situations where string concatenation is desired
> and one of the components is provided by a macro expansion. The question
> then is what the type of the string literal is. The only model I can see
> working there is to adopt the type from the first component and ignore the
> encoding-prefix from the remaining ones.
>
> I would not be heart broken over breaking code that does this.
>
>
> I understand that we can make _anything_ well-formed or
> conditionally supported.
> We should ask ourselves why?
> Is there a reason to allow that, which surpasses the cost of the added
> complexity?
> We can ( and I believe we should) make that ill-formed.
> (Implementations sensibly do not support these things anyway)
>
> Can you summarize what implementations do today? I haven't researched.
>
They do not support the combinations not mandated by the standard

>
>
>> I would like all of the combinations that do not involve L”” to have a
>> well defined value, whether or not we specify the encoding.
>>
> Well-defined or perhaps ill-formed?
>
> This is another case where I think the feature makes little or no sense,
> but, unless shown otherwise, doesn't cause problems in practice and should
> therefore be treated as a low priority issue relative to other things we
> could be working on.
>
>
>> I do not care to further constrain the L”” forms at this point.
>>
> +1.
>
> Tom.
>
>
>> AlisdairM
>>
>>
>> On Jul 8, 2020, at 12:35, Corentin Jabot <corentinjabot_at_[hidden]> wrote:
>>
>>
>>
>> On Wed, 8 Jul 2020 at 13:09, Alisdair Meredith via SG16 <
>> sg16_at_[hidden]> wrote:
>>
>>> After taking another look over P2029 resolving a few core issues,
>>> I am further concerned by [lex.string]p11, which states (among
>>> other things) that concatenation of unicode string literals with
>>> different encoding-prefixes is conditionally supported with
>>> implementation-defined behavior. That seems a little to free for
>>> my tastes.
>>>
>>
>> +1
>>
>>
>>> I can buy conditionally supported, although see no harm in
>>> requiring it for any combination of unicode encoding prefixes.
>>> I am concerned about the implementation-defined behavior:
>>> the end result should be the result of concatenating the
>>> transcoded representation of each of the strings into a common
>>> encoding, corresponding to one of the involved encoding
>>> prefixes. I am happy to defer to implementations to choose
>>> between UTF8/16/32, or we could define a canonical prefered
>>> ordering among those choices.
>>>
>>> Does this seem worth calling out (yet another SG16 paper) or
>>> better left alone, as we already have way too much busy work
>>> on this groups plate, and implementation will most likely do the
>>> right thing anyway?
>>>
>>
>> It's called out in P2178
>>
>> from a user perspective, we allow to manything for any mental model to
>> work
>>
>> u8"" "" and "" u8"" are both utf-8 strings
>>
>> what
>>
>> u8"" L""
>>
>> is supposed to be? utf-8? wide? How can I tell ? and it's not portable?
>>
>> There are only 2 models that make sense imo
>> Either
>> - Only the first string can have a prefix
>> - Only one of the string can have a prefix
>>
>> Note that, despite the terribly misleading wording the strings that are
>> concatenated do *not* have an associated encoding before concatenation,
>> the compiler choses a prefix after concatenation (this is yet something
>> needing fixing, i think tom started a thread a few days ago back)
>>
>> u8"" L"" is not utf-8 + wide (what would that mean?) but a string
>> interpreted as either utf-8 or wide depending on the whims of the
>> implementation.
>>
>> I think it needs fixing indeed :)
>>
>>
>>>
>>> (I am not overly concerned about specifying concatenation for
>>> narrow/wide string literals with unicode string literals, which
>>> can remain conditionally supported with implementation-defined
>>> values.)
>>>
>>> AlisdairM
>>> --
>>> SG16 mailing list
>>> SG16_at_[hidden]
>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>>
>>
>



SG16 list run by sg16-owner@lists.isocpp.org