On 12/4/21 6:05 AM, Jens Maurer wrote:

On 04/12/2021 09.26, Corentin Jabot wrote:


On Sat, Dec 4, 2021, 01:04 Tom Honermann <tom@honermann.net <mailto:tom@honermann.net>> wrote:

    I believe this requirement is already the status quo. Let me provide a better example than I did previously.

    std::format("<text>");

    If the literal encoding is not self-synchronizing then <text> may contain code units that correspond to the (single) code unit for '{' but that do not encode the '{' character. This can happen due to DBCS or shift-state encoding. An implementation needs to be able to recognize this case (for effected encodings) in order to avoid incorrectly interpreting the text as containing an introducer for a replacement field.


I am well aware.
I wonder if we understood that fully (compile time support and codepoint semantics were decision taken at about the same time independently of one another). I do not recall realizing that we were asking for full blown constexpr codepoint decode.

If we impose a requirement for a code unit -> code point decoder for the
literal encoding at compile-time, we should make such a facility generally
available instead of hiding it in the guts of the std::format parser.

I think JeanHeyd's work on P1629 will fill this niche. It would be nice if the features he proposes in N2730 were usable at compile-time as well, but that will likely have to await some kind of constexpr support in C.


That probably means at least a constexpr mbrlen and/or mbrtowc
with an assumed locale fitting the literal encoding.
Hm... It seems mbrlen returning 1 might not be as helpful as it
seems, because we could be in a foreign shift state where
code unit value == '}' does not actually mean a '}' is encoded.

A constexpr mbrlen() would at least prevent matching a trailing code unit, but yes, the mbstate_t object would also have to be consulted to determine which character is actually encoded.

I think I'd like to get input from implementers.
If I understand this msvc PR, support for compile time non UTF-8 multi bytes encoding is limited 
https://github.com/microsoft/STL/pull/2221 <https://github.com/microsoft/STL/pull/2221>

I am not opposed to the direction to be clear, but I am reluctant to go further down this road without implementers support. We are asking a lot.

Yes.

For reasons, the work to add EBCDIC to clang has a home grown encoder, for example, as clang cares about environments where iconv is not present.
This direction would likely, in addition to add constexpr builtins mandate that someone writes an EBCDIC -> utf decoder in clang or libc++.

It makes me wonder if some of these features should be restricted to u8 formatting strings 😅

We even have a "u8" string literal prefix to indicate UTF-8 string literals.
What a foresight.

Perhaps some day we'll even be able to pass such strings to std::format()!

Tom.


Jens