sg16: Re: [SG16] Agenda for the 2021-12-01 SG16 telecon

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Sat, 4 Dec 2021 12:05:45 +0100

On 04/12/2021 09.26, Corentin Jabot wrote:
>
>
> On Sat, Dec 4, 2021, 01:04 Tom Honermann <tom_at_[hidden] <mailto:tom_at_[hidden]>> wrote:

> I believe this requirement is already the status quo. Let me provide a better example than I did previously.
>
> std::format("<text>");
>
> If the literal encoding is not self-synchronizing then <text> may contain code units that correspond to the (single) code unit for '{' but that do not encode the '{' character. This can happen due to DBCS or shift-state encoding. An implementation needs to be able to recognize this case (for effected encodings) in order to avoid incorrectly interpreting the text as containing an introducer for a replacement field.
>
>
> I am well aware.
> I wonder if we understood that fully (compile time support and codepoint semantics were decision taken at about the same time independently of one another). I do not recall realizing that we were asking for full blown constexpr codepoint decode.

If we impose a requirement for a code unit -> code point decoder for the
literal encoding at compile-time, we should make such a facility generally
available instead of hiding it in the guts of the std::format parser.

That probably means at least a constexpr mbrlen and/or mbrtowc
with an assumed locale fitting the literal encoding.
Hm... It seems mbrlen returning 1 might not be as helpful as it
seems, because we could be in a foreign shift state where
code unit value == '}' does not actually mean a '}' is encoded.

> I think I'd like to get input from implementers.
> If I understand this msvc PR, support for compile time non UTF-8 multi bytes encoding is limited
> https://github.com/microsoft/STL/pull/2221 <https://github.com/microsoft/STL/pull/2221>
>
> I am not opposed to the direction to be clear, but I am reluctant to go further down this road without implementers support. We are asking a lot.

Yes.

> For reasons, the work to add EBCDIC to clang has a home grown encoder, for example, as clang cares about environments where iconv is not present.
> This direction would likely, in addition to add constexpr builtins mandate that someone writes an EBCDIC -> utf decoder in clang or libc++.
>
> It makes me wonder if some of these features should be restricted to u8 formatting strings 😅

We even have a "u8" string literal prefix to indicate UTF-8 string literals.
What a foresight.

Jens

Received on 2021-12-04 05:05:54