Subject: Re: [ WG14 ] Mixed Wide String Literals
From: Alisdair Meredith (alisdairm_at_[hidden])
Date: 2020-12-10 09:51:19
Comment in context WAY below...
> On Dec 9, 2020, at 16:03, Tom Honermann via SG16 <sg16_at_[hidden]> wrote:
> On 12/9/20 2:22 PM, Corentin Jabot via SG16 wrote:
>> On Wed, Dec 9, 2020, 05:52 Tom Honermann via SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>> On 12/8/20 11:21 AM, JeanHeyd Meneide via SG16 wrote:
>>> Dear Tom,
>>> On Sun, Dec 6, 2020 at 5:41 PM Tom Honermann <tom_at_[hidden] <mailto:tom_at_[hidden]>> wrote:
>>> On 12/4/20 10:48 AM, JeanHeyd Meneide via SG16 wrote:
>>>> WG14 also wants to look into -- after removing it -- spending
>>>> some time coming up with well-defined combination mechanism. We would
>>>> like to have ill-formedness (C++) / constraint violations (C) if the
>>>> conversion to the final chosen encoding do not work, to give people
>>>> good guarantees above e.g. synthesizing Unicode into an wide string
>>>> literal or synthesizing wide string data into Unicode literals.
>>> Was any particular use case discussed? Or just a general preference to, given something like this:
>>> #define NAME u8"foo"
>>> #define VERSION "5"
>>> void emit(const char16_t*);
>>> to be able to do something like this:
>>> emit(u"" NAME "-" VERSION);
>>> This, and the ability to take "implementation defined" text in either L"" or "" strings (e.g., from external headers you don't control or generated code) and force them to be a Unicode encoding by doing similar to the above -- prefixing with a u"". I think there's some utility to be had here when it comes to interfacing with legacy code and wanting to make sure macro expansions and code generation can be done well. This probably applies a lot more for C than C++, where references and `constexpr` arrays/string_views are generally used over macros and direct string literals.
>>> My best thought so far is that "first prefix is the chosen encoding". This also means you can write Unicode String literals and attempt to merge them into the desired narrow character set or wide character set (for compile-time). This has some advantages in being that if you know the data is going to go into a function take a `char*` or a `wchar_t*`, you can get a string literal into that encoding. Because we're also making the behavior well-defined, we can specify it as a constraint violation / ill-formed if the code points in any of the trailing string tokens is not representable in the first-specified encoding. (The usual "escaped from safety" rules for "\x" sequences would apply here too.)
>> Got it, thanks. I agree there are use cases for the behavior you describe of effectively ignoring any prefixes after the first one. One complication is that "" L"" is equivalent to L"" L"" (and L"" "") today; even after P2201 and N2594. That means prefixing an unadorned string literal won't have the effect of forcing an encoding. That would be inconsistent, but maybe that is ok.
>> How frequent is that pattern in actual production code?
> I have no idea, but I would hope it is exceedingly infrequent :)
My suspicion is that such patterns that occur in practice are likely masked by macro use, where a macro concatenates a user-supplied literal with a plain string literal.
SG16 list run by firstname.lastname@example.org