Subject: Re: [ WG14 ] Mixed Wide String Literals
From: Tom Honermann (tom_at_[hidden])
Date: 2020-12-08 22:39:01
On 12/8/20 11:21 AM, JeanHeyd Meneide via SG16 wrote:
> Dear Tom,
> On Sun, Dec 6, 2020 at 5:41 PM Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
> On 12/4/20 10:48 AM, JeanHeyd Meneide via SG16 wrote:
>> WG14 also wants to look into -- after removing it -- spending
>> some time coming up with well-defined combination mechanism. We would
>> like to have ill-formedness (C++) / constraint violations (C) if the
>> conversion to the final chosen encoding do not work, to give people
>> good guarantees above e.g. synthesizing Unicode into an wide string
>> literal or synthesizing wide string data into Unicode literals.
> Was any particular use case discussed?Â Or just a general
> preference to, given something like this:
> #define NAME u8"foo"
> #define VERSION "5"
> void emit(const char16_t*);
> to be able to do something like this:
> emit(u"" NAME "-" VERSION);
> Â Â Â This, and the ability to take "implementation defined" text in
> either L"" or "" strings (e.g., from external headers you don't
> control or generated code) and force them to be a Unicode encoding by
> doing similar to the above -- prefixing with a u"". I think there's
> some utility to be had here when it comes to interfacing with legacy
> code and wanting to make sure macro expansions and code generation can
> be done well. This probably applies a lot more for C than C++, where
> references and `constexpr` arrays/string_views are generally used over
> macros and direct string literals.
> Â Â Â Â My best thought so far is that "first prefix is the chosen
> encoding". This also means you can write Unicode String literals and
> attempt to merge them into the desired narrow character set or wide
> character set (for compile-time). This has some advantages in being
> that if you know the data is going to go into a function take a
> `char*` or a `wchar_t*`, you can get a string literal into that
> encoding. Because we're also making the behavior well-defined, we can
> specify it as a constraint violation / ill-formed if the code points
> in any of the trailing string tokens is not representable in the
> first-specified encoding. (The usual "escaped from safety" rules for
> "\x" sequences would apply here too.)
> That would be good to get on the agenda for the next WG14
> meeting.Â It would be great if we can start making progress on all
> of the following issues!Â My understanding is that any proposals
> targeting C2X must be proposed by August 27th, 2021.
> * WG14: Make char16_t/char32_t string literals be UTF-16/32
> * WG14 N2231: char8_t: A type for UTF-8 characters and strings
> * WG14: Improve support for Unicode characters in identifiers
> Â Â Â Â I can do the first bullet (the C version for C++'s p1041).
> Â Â Â Â I sort of already am helping with the second bullet. C approved
> me pursuing Non-UTF and UTF conversions
> In these function signatures, I used "unsigned char" for the c8
> functions; they did not approve any wording yet, but approved of the
> direction. I'm going to be attempting to get my C library, and
> implementations in musl-libc and glibc going in 2021. This means that
> when `char8_t` shows up as a typedef and the string literals are fixed
> up, everything can work out nicely and we can seamlessly change from
> unsigned char to char8_t in those functions!
Cool.Â I really do hope to get a revision of N2331
<http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2231.htm> wrapped up soon.
> Â Â Â Â The last bullet can totally be someone else's work! ð
Hmm, who might be a good candidate for that one?Â Steve?Â Steve. Steve! ð
SG16 list run by firstname.lastname@example.org