Subject: Re: Concatenating unicode string literals
From: Tom Honermann (tom_at_[hidden])
Date: 2020-07-08 11:54:47
On 7/8/20 12:22 PM, Corentin Jabot via SG16 wrote:
> On Wed, 8 Jul 2020 at 18:07, Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
> On 7/8/20 8:45 AM, Corentin Jabot wrote:
>> On Wed, 8 Jul 2020 at 13:56, Alisdair Meredith <alisdairm_at_[hidden]
>> <mailto:alisdairm_at_[hidden]>> wrote:
>> Well, u8ââ Lââ is portably ill-formed, and should be
>> diagnosed :)
>> The corner cases that are conditionally supported with an
>> implementation defined value are:
>> Â u8ââ uâ"
>> Â u8ââ Uââ
>> Â uââ u8ââ
>> Â Â uââ Uââ
>> Â u"â Lââ
>> Â Uââ u8ââ
>> Â Uââ uââ
>> Â Uââ Lââ
>> Â Lââ uââ
>> Â Lââ Uâ"
>> Let's entertain that this should, for some reason, not be ill-formed.
>> What should it do?
>> Can you find a set of rules that people will not trip over?
>> Is it like rock paper scissorsÂ but with u8, u and U instead?
> I can sort-of kind-of see a use case for allowing one of u"" or
> U"" to be concatenated with L"" as a conditionally-supported
> feature when the wide execution encoding is a match (UTF-16 for
> u"" or UTF-32 for U""). This might be useful in strange situations
> where string concatenation is desired and one of the components is
> provided by a macro expansion.Â The question then is what the type
> of the string literal is.Â The only model I can see working there
> is to adopt the type from the first component and ignore the
> encoding-prefix from the remaining ones.
> I would not be heart broken over breaking code that does this.
>> I understand that we can make _anything_ well-formed or
>> conditionallyÂ supported.
>> We should ask ourselves why?
>> Is there a reason to allow that, which surpasses the cost of the
>> added complexity?
>> We can ( and I believeÂ we should) make that ill-formed.
>> (Implementations sensibly do not support these things anyway)
> Can you summarize what implementations do today?Â I haven't
> They do not support the combinations not mandated by the standard
Which implementations did you check?Â Clang, gcc, MSVC, and icc?
>> I would like all of the combinations that do not involve Lââ
>> to have a
>> well defined value, whether or not we specify the encoding.
> Well-defined or perhaps ill-formed?
> This is another case where I think the feature makes little or no
> sense, but, unless shown otherwise, doesn't cause problems in
> practice and should therefore be treated as a low priority issue
> relative to other things we could be working on.
>> I do not care to further constrain the Lââ forms at this point.
>>> On Jul 8, 2020, at 12:35, Corentin Jabot
>>> <corentinjabot_at_[hidden] <mailto:corentinjabot_at_[hidden]>>
>>> On Wed, 8 Jul 2020 at 13:09, Alisdair Meredith via SG16
>>> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>>> After taking another look over P2029 resolving a few
>>> core issues,
>>> I am further concerned by [lex.string]p11, which states
>>> other things) that concatenation of unicode string
>>> literals with
>>> different encoding-prefixes is conditionally supported with
>>> implementation-defined behavior. That seems a little to
>>> free for
>>> my tastes.
>>> I can buy conditionally supported, although see no harm in
>>> requiring it for any combination of unicode encoding
>>> I am concerned about the implementation-defined behavior:
>>> the end result should be the result of concatenating the
>>> transcoded representation of each of the strings into a
>>> encoding, corresponding to one of the involved encoding
>>> prefixes.Â I am happy to defer to implementations to choose
>>> between UTF8/16/32, or we could define a canonical prefered
>>> ordering among those choices.
>>> Does this seem worth calling out (yet another SG16 paper) or
>>> better left alone, as we already have way too much busy work
>>> on this groups plate, and implementation will most
>>> likely do the
>>> right thing anyway?
>>> It's called out in P2178
>>> from a user perspective, we allow to manything for any
>>> mental model to work
>>> u8"" "" and "" u8"" are both utf-8 strings
>>> u8"" L""
>>> is supposed to be? utf-8?Â wide? How can I tell ? and it's
>>> not portable?
>>> There are only 2 models that make sense imo
>>> - Only the first string can have a prefix
>>> - Only one of the string can have a prefix
>>> Note that, despite the terribly misleading wording the
>>> strings that are concatenated do *not* have an associated
>>> encoding before concatenation,
>>> the compiler choses a prefix after concatenation (this is
>>> yet something needing fixing, i think tom started a thread a
>>> few days ago back)
>>> u8"" L""Â is not utf-8Â + wide (what would that mean?) but a
>>> string interpreted as either utf-8 or wide depending on
>>> theÂ whims of the implementation.
>>> I think it needs fixing indeed :)
>>> (I am not overly concerned about specifying
>>> concatenation for
>>> narrow/wide string literals with unicode string
>>> literals, which
>>> can remain conditionally supported with
>>> SG16 mailing list
>>> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
SG16 list run by firstname.lastname@example.org