C++ Logo

SG16

Advanced search

Subject: Re: Concatenating unicode string literals
From: Corentin Jabot (corentinjabot_at_[hidden])
Date: 2020-07-08 07:45:27


On Wed, 8 Jul 2020 at 13:56, Alisdair Meredith <alisdairm_at_[hidden]> wrote:

> Well, u8”” L”” is portably ill-formed, and should be diagnosed :)
>
> The corner cases that are conditionally supported with an
> implementation defined value are:
>
> u8”” u”"
> u8”” U””
> u”” u8””
> u”” U””
> u"” L””
> U”” u8””
> U”” u””
> U”” L””
> L”” u””
> L”” U”"
>

Let's entertain that this should, for some reason, not be ill-formed.
What should it do?
Can you find a set of rules that people will not trip over?
Is it like rock paper scissors but with u8, u and U instead?

I understand that we can make _anything_ well-formed or
conditionally supported.
We should ask ourselves why?
Is there a reason to allow that, which surpasses the cost of the added
complexity?
We can ( and I believe we should) make that ill-formed.
(Implementations sensibly do not support these things anyway)

> I would like all of the combinations that do not involve L”” to have a
> well defined value, whether or not we specify the encoding.
>
> I do not care to further constrain the L”” forms at this point.
>
> AlisdairM
>
>
> On Jul 8, 2020, at 12:35, Corentin Jabot <corentinjabot_at_[hidden]> wrote:
>
>
>
> On Wed, 8 Jul 2020 at 13:09, Alisdair Meredith via SG16 <
> sg16_at_[hidden]> wrote:
>
>> After taking another look over P2029 resolving a few core issues,
>> I am further concerned by [lex.string]p11, which states (among
>> other things) that concatenation of unicode string literals with
>> different encoding-prefixes is conditionally supported with
>> implementation-defined behavior. That seems a little to free for
>> my tastes.
>>
>
> +1
>
>
>> I can buy conditionally supported, although see no harm in
>> requiring it for any combination of unicode encoding prefixes.
>> I am concerned about the implementation-defined behavior:
>> the end result should be the result of concatenating the
>> transcoded representation of each of the strings into a common
>> encoding, corresponding to one of the involved encoding
>> prefixes. I am happy to defer to implementations to choose
>> between UTF8/16/32, or we could define a canonical prefered
>> ordering among those choices.
>>
>> Does this seem worth calling out (yet another SG16 paper) or
>> better left alone, as we already have way too much busy work
>> on this groups plate, and implementation will most likely do the
>> right thing anyway?
>>
>
> It's called out in P2178
>
> from a user perspective, we allow to manything for any mental model to work
>
> u8"" "" and "" u8"" are both utf-8 strings
>
> what
>
> u8"" L""
>
> is supposed to be? utf-8? wide? How can I tell ? and it's not portable?
>
> There are only 2 models that make sense imo
> Either
> - Only the first string can have a prefix
> - Only one of the string can have a prefix
>
> Note that, despite the terribly misleading wording the strings that are
> concatenated do *not* have an associated encoding before concatenation,
> the compiler choses a prefix after concatenation (this is yet something
> needing fixing, i think tom started a thread a few days ago back)
>
> u8"" L"" is not utf-8 + wide (what would that mean?) but a string
> interpreted as either utf-8 or wide depending on the whims of the
> implementation.
>
> I think it needs fixing indeed :)
>
>
>>
>> (I am not overly concerned about specifying concatenation for
>> narrow/wide string literals with unicode string literals, which
>> can remain conditionally supported with implementation-defined
>> values.)
>>
>> AlisdairM
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
>
>



SG16 list run by sg16-owner@lists.isocpp.org