C++ Logo


Advanced search

Re: [wg14/wg21 liaison] [SG16] [ WG14 ] Mixed Wide String Literals

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 9 Dec 2020 16:03:04 -0500
On 12/9/20 2:22 PM, Corentin Jabot via SG16 wrote:
> On Wed, Dec 9, 2020, 05:52 Tom Honermann via SG16
> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
> On 12/8/20 11:21 AM, JeanHeyd Meneide via SG16 wrote:
>> Dear Tom,
>> On Sun, Dec 6, 2020 at 5:41 PM Tom Honermann <tom_at_[hidden]
>> <mailto:tom_at_[hidden]>> wrote:
>> On 12/4/20 10:48 AM, JeanHeyd Meneide via SG16 wrote:
>>> BUT!
>>> WG14 also wants to look into -- after removing it -- spending
>>> some time coming up with well-defined combination mechanism. We would
>>> like to have ill-formedness (C++) / constraint violations (C) if the
>>> conversion to the final chosen encoding do not work, to give people
>>> good guarantees above e.g. synthesizing Unicode into an wide string
>>> literal or synthesizing wide string data into Unicode literals.
>> Was any particular use case discussed? Or just a general
>> preference to, given something like this:
>> #define NAME u8"foo"
>> #define VERSION "5"
>> void emit(const char16_t*);
>> to be able to do something like this:
>> emit(u"" NAME "-" VERSION);
>> This, and the ability to take "implementation defined" text
>> in either L"" or "" strings (e.g., from external headers you
>> don't control or generated code) and force them to be a Unicode
>> encoding by doing similar to the above -- prefixing with a u"". I
>> think there's some utility to be had here when it comes to
>> interfacing with legacy code and wanting to make sure macro
>> expansions and code generation can be done well. This probably
>> applies a lot more for C than C++, where references and
>> `constexpr` arrays/string_views are generally used over macros
>> and direct string literals.
>> My best thought so far is that "first prefix is the chosen
>> encoding". This also means you can write Unicode String literals
>> and attempt to merge them into the desired narrow character set
>> or wide character set (for compile-time). This has some
>> advantages in being that if you know the data is going to go into
>> a function take a `char*` or a `wchar_t*`, you can get a string
>> literal into that encoding. Because we're also making the
>> behavior well-defined, we can specify it as a constraint
>> violation / ill-formed if the code points in any of the trailing
>> string tokens is not representable in the first-specified
>> encoding. (The usual "escaped from safety" rules for "\x"
>> sequences would apply here too.)
> Got it, thanks. I agree there are use cases for the behavior you
> describe of effectively ignoring any prefixes after the first
> one. One complication is that "" L"" is equivalent to L"" L""
> (and L"" "") today; even after P2201 and N2594. That means
> prefixing an unadorned string literal won't have the effect of
> forcing an encoding. That would be inconsistent, but maybe that is ok.
> How frequent is that pattern in actual production code?

I have no idea, but I would hope it is exceedingly infrequent :)


> Tom.
> --
> SG16 mailing list
> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16

Received on 2020-12-09 15:03:08