sg16: Re: [SG16] [ WG14 ] Mixed Wide String Literals

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 8 Dec 2020 23:52:16 -0500

On 12/8/20 11:21 AM, JeanHeyd Meneide via SG16 wrote:
> Dear Tom,
>
> On Sun, Dec 6, 2020 at 5:41 PM Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> On 12/4/20 10:48 AM, JeanHeyd Meneide via SG16 wrote:
>> BUT!
>>
>> WG14 also wants to look into -- after removing it -- spending
>> some time coming up with well-defined combination mechanism. We would
>> like to have ill-formedness (C++) / constraint violations (C) if the
>> conversion to the final chosen encoding do not work, to give people
>> good guarantees above e.g. synthesizing Unicode into an wide string
>> literal or synthesizing wide string data into Unicode literals.
>
> Was any particular use case discussed? Or just a general
> preference to, given something like this:
>
> #define NAME u8"foo"
> #define VERSION "5"
>
> void emit(const char16_t*);
>
> to be able to do something like this:
>
> emit(u"" NAME "-" VERSION);
>
>
> This, and the ability to take "implementation defined" text in
> either L"" or "" strings (e.g., from external headers you don't
> control or generated code) and force them to be a Unicode encoding by
> doing similar to the above -- prefixing with a u"". I think there's
> some utility to be had here when it comes to interfacing with legacy
> code and wanting to make sure macro expansions and code generation can
> be done well. This probably applies a lot more for C than C++, where
> references and `constexpr` arrays/string_views are generally used over
> macros and direct string literals.
>
> My best thought so far is that "first prefix is the chosen
> encoding". This also means you can write Unicode String literals and
> attempt to merge them into the desired narrow character set or wide
> character set (for compile-time). This has some advantages in being
> that if you know the data is going to go into a function take a
> `char*` or a `wchar_t*`, you can get a string literal into that
> encoding. Because we're also making the behavior well-defined, we can
> specify it as a constraint violation / ill-formed if the code points
> in any of the trailing string tokens is not representable in the
> first-specified encoding. (The usual "escaped from safety" rules for
> "\x" sequences would apply here too.)

Got it, thanks. I agree there are use cases for the behavior you
describe of effectively ignoring any prefixes after the first one. One
complication is that "" L"" is equivalent to L"" L"" (and L"" "") today;
even after P2201 and N2594. That means prefixing an unadorned string
literal won't have the effect of forcing an encoding. That would be
inconsistent, but maybe that is ok.

Tom.

Received on 2020-12-08 22:52:19