Dear Tom,

On Sun, Dec 6, 2020 at 5:41 PM Tom Honermann <tom@honermann.net> wrote:

On 12/4/20 10:48 AM, JeanHeyd Meneide via SG16 wrote:
     BUT!

     WG14 also wants to look into -- after removing it -- spending
some time coming up with well-defined combination mechanism. We would
like to have ill-formedness (C++) / constraint violations (C) if the
conversion to the final chosen encoding do not work, to give people
good guarantees above e.g. synthesizing Unicode into an wide string
literal or synthesizing wide string data into Unicode literals.
Was any particular use case discussed? Or just a general preference to, given something like this:

#define NAME u8"foo"#define VERSION "5" void emit(const char16_t*);

to be able to do something like this:

emit(u"" NAME "-" VERSION);

This, and the ability to take "implementation defined" text in either L"" or "" strings (e.g., from external headers you don't control or generated code) and force them to be a Unicode encoding by doing similar to the above -- prefixing with a u"". I think there's some utility to be had here when it comes to interfacing with legacy code and wanting to make sure macro expansions and code generation can be done well. This probably applies a lot more for C than C++, where references and `constexpr` arrays/string_views are generally used over macros and direct string literals.

My best thought so far is that "first prefix is the chosen encoding". This also means you can write Unicode String literals and attempt to merge them into the desired narrow character set or wide character set (for compile-time). This has some advantages in being that if you know the data is going to go into a function take a `char*` or a `wchar_t*`, you can get a string literal into that encoding. Because we're also making the behavior well-defined, we can specify it as a constraint violation / ill-formed if the code points in any of the trailing string tokens is not representable in the first-specified encoding. (The usual "escaped from safety" rules for "\x" sequences would apply here too.)

That would be good to get on the agenda for the next WG14 meeting. It would be great if we can start making progress on all of the following issues! My understanding is that any proposals targeting C2X must be proposed by August 27th, 2021.

WG14: Make char16_t/char32_t string literals be UTF-16/32

WG14 N2231: char8_t: A type for UTF-8 characters and strings

WG14: Improve support for Unicode characters in identifiers

I can do the first bullet (the C version for C++'s p1041).

I sort of already am helping with the second bullet. C approved me pursuing Non-UTF and UTF conversions (https://thephd.github.io/_vendor/future_cxx/papers/C%20-%20Efficient%20UTF%20Character%20Conversions.html and https://thephd.github.io/_vendor/future_cxx/papers/C%20-%20Efficient%20Character%20Conversions.html). In these function signatures, I used "unsigned char" for the c8 functions; they did not approve any wording yet, but approved of the direction. I'm going to be attempting to get my C library, and implementations in musl-libc and glibc going in 2021. This means that when `char8_t` shows up as a typedef and the string literals are fixed up, everything can work out nicely and we can seamlessly change from unsigned char to char8_t in those functions!

The last bullet can totally be someone else's work! 😁

Sincerely,

JeanHeyd