ISOCPP sg16 List: Re: [isocpp-sg16] std::format and charN

From: Tiago Freire <tmiguelf_at_[hidden]>
Date: Fri, 28 Jun 2024 14:29:48 +0000

Hi Ivan,
I'm not entirely sure about "plans", but I believe there's a willingness to provide formatting using different encodings, which would reduce the need for buffer transfers.
You can technically write a custom "output" that would allow you to write directly to a contiguous buffer that you provide, but knowing the size of the buffer in advance is a challenge.

But I understand your point of view, the design of std::formatter is suboptimal, for me it was obsolete the day it came out due to this pain point.
Don't get me wrong, it is still better than what it came before, but I have better:
https://github.com/tmiguelf/utilities/blob/master/CoreLib/include/CoreLib/toPrint/toPrint.hpp

I have worked on what is right now a quite feature complete alternative way of formatting that is capable of doing the whole thing on the stack and format it in any encoding,
and the design is also quite easy to add on such that you can have your custom buffer where you allocate once with the exact right size you need.

I would like for that type of design to eventually make itself onto the standard, but it is going to take some convincing which ain't easy, and currently I've enough on my plate to start on that journey.
But have a look at it, if there's something that catches your interest just let me know and I will try providing some help.

Br,
Tiago

-----Original Message-----
From: SG16 <sg16-bounces_at_lists.isocpp.org> On Behalf Of Ivan Solovev via SG16
Sent: Friday, June 28, 2024 11:04
To: sg16_at_[hidden]pp.org
Cc: Ivan Solovev <ivan.solovev_at_[hidden]>
Subject: Re: [isocpp-sg16] std::format and charN_t

Hi,

let me follow up on my original questions.
After reading P3258, we moved forward with providing std::format support for Qt string types. However, we figured out that the implementation is suboptimal, and so we have some questions.

Our implementation is now based on std::formatter<std::string_view, char>, which means that we take a UTF-16 string (which QString is), convert it to
UTF-8 using some of our internal methods, and then feed this UTF-8 string into std::formatter<std::string_view, char>::format(), which is the then responsible for formatting it (which implies one more copy of the data).

As you can see, there is one extra copy here, which we would like to avoid.
We could do the UTF-16 to UTF-8 conversion codepoint-by-codepoint, and then copy individual codepoints to the output, thus saving an allocation of a UTf-8 string, but IIUC that would require us to completely reimplement the whole parse() and format() logic of the standard formatters.
That's something we obviously want to avoid.

Based on the above, we have the following questions:
* are there any plans to provide the possibility to reserve space in the output?
* are there any plans to provide the possibility to write directly to the output, assuming it's a contiguous buffer?

Another thing which could be really beneficial for us is the possibility to format into a QString (which is a char16_t container).
However, currently the standard only supports formatting into char and wchar_t. Is it going to change? Are there any plans to provide support for formatting into charN_t?

Best regards,
Ivan

------------------------------

Ivan Solovev
Senior Software Engineer

The Qt Company GmbH
Erich-Thilo-Str. 10
12489 Berlin, Germany
ivan.solovev_at_[hidden]
www.qt.io

Geschäftsführer: Mika Pälsi,
Juha Varelius, Jouni Lintunen
Sitz der Gesellschaft: Berlin,
Registergericht: Amtsgericht
Charlottenburg, HRB 144331 B

________________________________________
From: SG16 <sg16-bounces_at_lists.isocpp.org> on behalf of Thiago Macieira via SG16 <sg16_at_lists.isocpp.org>
Sent: Tuesday, June 11, 2024 4:44 PM
To: sg16_at_[hidden]
Cc: Thiago Macieira
Subject: Re: [isocpp-sg16] std::format and charN_t

On Tuesday 11 June 2024 04:00:14 GMT-7 Ivan Solovev via SG16 wrote:
> Hi Corentin,
>
>
> > https://wg21.link/P3258
> > I hope that answers some of your question
>
>
> Thanks for the link. That definitely clarifies the things.
>
> As I read it, the paper suggests to use std::text_encoding::literal()
> to determine the literal encoding and fall back to the execution
> encoding, if we could not get the literal encoding, of if there is no
> converter to literal encoding. And then transcode all the string
> arguments into this encoding. Is that correct?

Qt requires that the encoding on Unix systems be UTF-8, so we don't need to query that. On Windows, we already know what the local 8-bit encoding is, but as we discussed in our mailing list, nothing but UTF-8 makes sense anyway, so we can also assume that there. Qt requires and assumes that narrow string literals passed anywhere to its functions are UTF-8 and the legacy Windows 8- bit encoding ("ANSI") is only used to interact with a handful of likewise legacy functions. And explicitly not with the terminal, because that's yet another 8-bit encoding.

So the paper above doesn't add new information, but does explain that the direction std::format is going is compatible with our thinking.

I do have one question and maybe some answers. First, the paper says

"for each code unit sequence X [that] is a sequence of ill-formed code units, processing is in order as follows:
[...]
- Otherwise ReplacementCharacter is appended to E."

Is this intended to mandate that the full sequence of ill-formed code units be replaced by a single replacement character? Or could the implementation insert more than one? I'm personally of the opinion that GIGO, therefore we should not mandate in precision how to deal with ill-formed sequences, other than that it should output replacement character(s).

In any case, is the paper trying to explain how to do transcoding?

As for answers, the paper asks in the charN_t formatting section:

> How does to_chars and char8_t interract? Unicode has a large set of numbers.

I don't think any differently than what's already there. It's only a matter of encoding. std::to_chars operates only on the C locale, so any other-locale encoding must by necessity use something else, regardless of the character type in the output.

So the question raises a valid point, but orthogonal to the encoding output.

> Are existing locale facilities sufficient to support the needs of Unicode?

Probably not, but that's speculation. I don't see anything that we'd need of them in order to have charN_t formatters. The difficult part may be to get the number and money punctuation formatters in the right encoding, but if that's just encoding, we can transcode. There may be some compromises in the locale database itself that were made to fit the std::locale API available at the
time: for example, we've had an issue in QLocale where some locales had a multi-character delimiter but our API and implementation only supported a single UTF-32 character, so we dropped the extra information.

> What do we assume the encoding for char and wchar_t to be?

I'd like to answer, "the same as the charN_t of the same size", but for the standard that is not acceptable. We may be able to get away with that for
wchar_t: are there any implementations that don't use either UTF-16 or UTF-32 for wchar_t? We may as well bless that it is indeed a size-varying type that is equivalent to the charN_t of equivalent size.

For char, I wish you luck. You probably can't assume anything so there must be encoding-conversion functions and they must not be inline. Therefore, formatting to/from charN_t and char must go through out-of-line transcoding calls.

> What is the implementation burden?

Question is vague. Do you mean for the Standard Library implementers or for library authors providing formatters?

Assuming it's the former (because of the next question), I would say it is a non-negligible effort to get right. For one thing, all of the transcoding must be done out-of-line. For us in Qt, this is actually performance-relevant (not critical, but it does show up in benchmarks), so we need it to be fast and not bloat code by getting inlined everywhere. Implementations will need therefore to feed back into API, so I see no way of knowing what the burden or API will be until we try both at the same time.

For example, I have looked into the formatter API but failed to find a way to request of it a buffer of an estimated (upper) size into which to write my output and only at the end let it know how much I wrote. I don't know if I missed something.

> What is the interaction with user-defined formatters?

I see three possibilities:
a) user provided a generic-encoding formatter or explicitly for the charN_t in question, so no further work is needed

b) user provided a wchar_t formatter (and maybe a char one), which may be preferable because we know this to be Unicode

c) user only provided a char formatter

Software written in the 2020s is UTF-8-aware so even in case (c) we ought to be able to use the char formatter and transcode to the required UTF-N format.
The transformation between char and char8_t may also be the identity by vendor's choice, reducing the implementation burden.

I would prefer if the standard left unsaid which of the two characters is used as the formatter before transcoding, but I don't know if that's acceptable.

--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Principal Engineer - Intel DCAI Fleet Systems Engineering

--
SG16 mailing list
SG16_at_[hidden]
https://lists.isocpp.org/mailman/listinfo.cgi/sg16
--
SG16 mailing list
SG16_at_[hidden]
https://lists.isocpp.org/mailman/listinfo.cgi/sg16

Received on 2024-06-28 14:29:52