sg16: Re: [SG16] Agenda for the 2021-12-01 SG16 telecon

From: Tom Honermann <tom_at_[hidden]>
Date: Sat, 4 Dec 2021 19:04:48 -0500

On 12/4/21 6:05 AM, Jens Maurer wrote:
> On 04/12/2021 09.26, Corentin Jabot wrote:
>>
>> On Sat, Dec 4, 2021, 01:04 Tom Honermann <tom_at_[hidden] <mailto:tom_at_[hidden]>> wrote:
>> I believe this requirement is already the status quo. Let me provide a better example than I did previously.
>>
>> std::format("<text>");
>>
>> If the literal encoding is not self-synchronizing then <text> may contain code units that correspond to the (single) code unit for '{' but that do not encode the '{' character. This can happen due to DBCS or shift-state encoding. An implementation needs to be able to recognize this case (for effected encodings) in order to avoid incorrectly interpreting the text as containing an introducer for a replacement field.
>>
>>
>> I am well aware.
>> I wonder if we understood that fully (compile time support and codepoint semantics were decision taken at about the same time independently of one another). I do not recall realizing that we were asking for full blown constexpr codepoint decode.
> If we impose a requirement for a code unit -> code point decoder for the
> literal encoding at compile-time, we should make such a facility generally
> available instead of hiding it in the guts of the std::format parser.
I think JeanHeyd's work on P1629 <https://wg21.link/p1629> will fill
this niche. It would be nice if the features he proposes in N2730
<http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2730.htm> were usable
at compile-time as well, but that will likely have to await some kind of
constexpr support in C.
>
> That probably means at least a constexpr mbrlen and/or mbrtowc
> with an assumed locale fitting the literal encoding.
> Hm... It seems mbrlen returning 1 might not be as helpful as it
> seems, because we could be in a foreign shift state where
> code unit value == '}' does not actually mean a '}' is encoded.
A constexpr mbrlen() would at least prevent matching a trailing code
unit, but yes, the mbstate_t object would also have to be consulted to
determine which character is actually encoded.
>
>> I think I'd like to get input from implementers.
>> If I understand this msvc PR, support for compile time non UTF-8 multi bytes encoding is limited
>> https://github.com/microsoft/STL/pull/2221 <https://github.com/microsoft/STL/pull/2221>
>>
>> I am not opposed to the direction to be clear, but I am reluctant to go further down this road without implementers support. We are asking a lot.
> Yes.
>
>> For reasons, the work to add EBCDIC to clang has a home grown encoder, for example, as clang cares about environments where iconv is not present.
>> This direction would likely, in addition to add constexpr builtins mandate that someone writes an EBCDIC -> utf decoder in clang or libc++.
>>
>> It makes me wonder if some of these features should be restricted to u8 formatting strings 😅
> We even have a "u8" string literal prefix to indicate UTF-8 string literals.
> What a foresight.

Perhaps some day we'll even be able to pass such strings to std::format()!

Tom.

>
> Jens

Received on 2021-12-04 18:04:51