C++ Logo

liaison

Advanced search

Re: [wg14/wg21 liaison] [SG16] [ WG14 ] Mixed Wide String Literals

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 8 Dec 2020 23:39:01 -0500
On 12/8/20 11:21 AM, JeanHeyd Meneide via SG16 wrote:
> Dear Tom,
>
> On Sun, Dec 6, 2020 at 5:41 PM Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> On 12/4/20 10:48 AM, JeanHeyd Meneide via SG16 wrote:
>> BUT!
>>
>> WG14 also wants to look into -- after removing it -- spending
>> some time coming up with well-defined combination mechanism. We would
>> like to have ill-formedness (C++) / constraint violations (C) if the
>> conversion to the final chosen encoding do not work, to give people
>> good guarantees above e.g. synthesizing Unicode into an wide string
>> literal or synthesizing wide string data into Unicode literals.
>
> Was any particular use case discussed? Or just a general
> preference to, given something like this:
>
> #define NAME u8"foo"
> #define VERSION "5"
>
> void emit(const char16_t*);
>
> to be able to do something like this:
>
> emit(u"" NAME "-" VERSION);
>
>
> This, and the ability to take "implementation defined" text in
> either L"" or "" strings (e.g., from external headers you don't
> control or generated code) and force them to be a Unicode encoding by
> doing similar to the above -- prefixing with a u"". I think there's
> some utility to be had here when it comes to interfacing with legacy
> code and wanting to make sure macro expansions and code generation can
> be done well. This probably applies a lot more for C than C++, where
> references and `constexpr` arrays/string_views are generally used over
> macros and direct string literals.
>
> My best thought so far is that "first prefix is the chosen
> encoding". This also means you can write Unicode String literals and
> attempt to merge them into the desired narrow character set or wide
> character set (for compile-time). This has some advantages in being
> that if you know the data is going to go into a function take a
> `char*` or a `wchar_t*`, you can get a string literal into that
> encoding. Because we're also making the behavior well-defined, we can
> specify it as a constraint violation / ill-formed if the code points
> in any of the trailing string tokens is not representable in the
> first-specified encoding. (The usual "escaped from safety" rules for
> "\x" sequences would apply here too.)
>
> That would be good to get on the agenda for the next WG14
> meeting. It would be great if we can start making progress on all
> of the following issues! My understanding is that any proposals
> targeting C2X must be proposed by August 27th, 2021.
>
> * WG14: Make char16_t/char32_t string literals be UTF-16/32
> <https://github.com/sg16-unicode/sg16/issues/54>
> * WG14 N2231: char8_t: A type for UTF-8 characters and strings
> <https://github.com/sg16-unicode/sg16/issues/5>
> * WG14: Improve support for Unicode characters in identifiers
> <https://github.com/sg16-unicode/sg16/issues/56>
>
> I can do the first bullet (the C version for C++'s p1041).
Excellent!
>
> I sort of already am helping with the second bullet. C approved
> me pursuing Non-UTF and UTF conversions
> (https://thephd.github.io/_vendor/future_cxx/papers/C%20-%20Efficient%20UTF%20Character%20Conversions.html
> and
> https://thephd.github.io/_vendor/future_cxx/papers/C%20-%20Efficient%20Character%20Conversions.html).
> In these function signatures, I used "unsigned char" for the c8
> functions; they did not approve any wording yet, but approved of the
> direction. I'm going to be attempting to get my C library, and
> implementations in musl-libc and glibc going in 2021. This means that
> when `char8_t` shows up as a typedef and the string literals are fixed
> up, everything can work out nicely and we can seamlessly change from
> unsigned char to char8_t in those functions!
Cool. I really do hope to get a revision of N2331
<http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2231.htm> wrapped up soon.
>
> The last bullet can totally be someone else's work! 😁

Hmm, who might be a good candidate for that one? Steve? Steve. Steve! 😈

Tom.

>
> Sincerely,
> JeanHeyd
>


Received on 2020-12-08 22:39:05