C++ Logo


Advanced search

Re: [SG16] Concatenating unicode string literals

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Wed, 8 Jul 2020 14:45:27 +0200
On Wed, 8 Jul 2020 at 13:56, Alisdair Meredith <alisdairm_at_[hidden]> wrote:

> Well, u8”” L”” is portably ill-formed, and should be diagnosed :)
> The corner cases that are conditionally supported with an
> implementation defined value are:
> u8”” u”"
> u8”” U””
> u”” u8””
> u”” U””
> u"” L””
> U”” u8””
> U”” u””
> U”” L””
> L”” u””
> L”” U”"

Let's entertain that this should, for some reason, not be ill-formed.
What should it do?
Can you find a set of rules that people will not trip over?
Is it like rock paper scissors but with u8, u and U instead?

I understand that we can make _anything_ well-formed or
conditionally supported.
We should ask ourselves why?
Is there a reason to allow that, which surpasses the cost of the added
We can ( and I believe we should) make that ill-formed.
(Implementations sensibly do not support these things anyway)

> I would like all of the combinations that do not involve L”” to have a
> well defined value, whether or not we specify the encoding.
> I do not care to further constrain the L”” forms at this point.
> AlisdairM
> On Jul 8, 2020, at 12:35, Corentin Jabot <corentinjabot_at_[hidden]> wrote:
> On Wed, 8 Jul 2020 at 13:09, Alisdair Meredith via SG16 <
> sg16_at_[hidden]> wrote:
>> After taking another look over P2029 resolving a few core issues,
>> I am further concerned by [lex.string]p11, which states (among
>> other things) that concatenation of unicode string literals with
>> different encoding-prefixes is conditionally supported with
>> implementation-defined behavior. That seems a little to free for
>> my tastes.
> +1
>> I can buy conditionally supported, although see no harm in
>> requiring it for any combination of unicode encoding prefixes.
>> I am concerned about the implementation-defined behavior:
>> the end result should be the result of concatenating the
>> transcoded representation of each of the strings into a common
>> encoding, corresponding to one of the involved encoding
>> prefixes. I am happy to defer to implementations to choose
>> between UTF8/16/32, or we could define a canonical prefered
>> ordering among those choices.
>> Does this seem worth calling out (yet another SG16 paper) or
>> better left alone, as we already have way too much busy work
>> on this groups plate, and implementation will most likely do the
>> right thing anyway?
> It's called out in P2178
> from a user perspective, we allow to manything for any mental model to work
> u8"" "" and "" u8"" are both utf-8 strings
> what
> u8"" L""
> is supposed to be? utf-8? wide? How can I tell ? and it's not portable?
> There are only 2 models that make sense imo
> Either
> - Only the first string can have a prefix
> - Only one of the string can have a prefix
> Note that, despite the terribly misleading wording the strings that are
> concatenated do *not* have an associated encoding before concatenation,
> the compiler choses a prefix after concatenation (this is yet something
> needing fixing, i think tom started a thread a few days ago back)
> u8"" L"" is not utf-8 + wide (what would that mean?) but a string
> interpreted as either utf-8 or wide depending on the whims of the
> implementation.
> I think it needs fixing indeed :)
>> (I am not overly concerned about specifying concatenation for
>> narrow/wide string literals with unicode string literals, which
>> can remain conditionally supported with implementation-defined
>> values.)
>> AlisdairM
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16

Received on 2020-07-08 07:48:53