sg16: Re: [SG16] [ WG14 ] Mixed Wide String Literals

From: JeanHeyd Meneide <phdofthehouse_at_[hidden]>
Date: Tue, 8 Dec 2020 11:21:27 -0500

Dear Tom,

On Sun, Dec 6, 2020 at 5:41 PM Tom Honermann <tom_at_[hidden]> wrote:

> On 12/4/20 10:48 AM, JeanHeyd Meneide via SG16 wrote:
>
> BUT!
>
> WG14 also wants to look into -- after removing it -- spending
> some time coming up with well-defined combination mechanism. We would
> like to have ill-formedness (C++) / constraint violations (C) if the
> conversion to the final chosen encoding do not work, to give people
> good guarantees above e.g. synthesizing Unicode into an wide string
> literal or synthesizing wide string data into Unicode literals.
>
> Was any particular use case discussed? Or just a general preference to,
> given something like this:
>
> #define NAME u8"foo"
> #define VERSION "5"
>
> void emit(const char16_t*);
>
> to be able to do something like this:
>
> emit(u"" NAME "-" VERSION);
>
>
    This, and the ability to take "implementation defined" text in either
L"" or "" strings (e.g., from external headers you don't control or
generated code) and force them to be a Unicode encoding by doing similar to
the above -- prefixing with a u"". I think there's some utility to be had
here when it comes to interfacing with legacy code and wanting to make sure
macro expansions and code generation can be done well. This probably
applies a lot more for C than C++, where references and `constexpr`
arrays/string_views are generally used over macros and direct string
literals.

     My best thought so far is that "first prefix is the chosen encoding".
This also means you can write Unicode String literals and attempt to merge
them into the desired narrow character set or wide character set (for
compile-time). This has some advantages in being that if you know the data
is going to go into a function take a `char*` or a `wchar_t*`, you can get
a string literal into that encoding. Because we're also making the behavior
well-defined, we can specify it as a constraint violation / ill-formed if
the code points in any of the trailing string tokens is not representable
in the first-specified encoding. (The usual "escaped from safety" rules for
"\x" sequences would apply here too.)

That would be good to get on the agenda for the next WG14 meeting. It
> would be great if we can start making progress on all of the following
> issues! My understanding is that any proposals targeting C2X must be
> proposed by August 27th, 2021.
>
> - WG14: Make char16_t/char32_t string literals be UTF-16/32
> <https://github.com/sg16-unicode/sg16/issues/54>
> - WG14 N2231: char8_t: A type for UTF-8 characters and strings
> <https://github.com/sg16-unicode/sg16/issues/5>
> - WG14: Improve support for Unicode characters in identifiers
> <https://github.com/sg16-unicode/sg16/issues/56>
>
> I can do the first bullet (the C version for C++'s p1041).

     I sort of already am helping with the second bullet. C approved me
pursuing Non-UTF and UTF conversions (
https://thephd.github.io/_vendor/future_cxx/papers/C%20-%20Efficient%20UTF%20Character%20Conversions.html
and
https://thephd.github.io/_vendor/future_cxx/papers/C%20-%20Efficient%20Character%20Conversions.html).
In these function signatures, I used "unsigned char" for the c8 functions;
they did not approve any wording yet, but approved of the direction. I'm
going to be attempting to get my C library, and implementations in
musl-libc and glibc going in 2021. This means that when `char8_t` shows up
as a typedef and the string literals are fixed up, everything can work out
nicely and we can seamlessly change from unsigned char to char8_t in those
functions!

     The last bullet can totally be someone else's work! 😁

Sincerely,
JeanHeyd

Received on 2020-12-08 10:21:42