ISOCPP sg16 List: Re: [isocpp-sg16] std::format and charN

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Fri, 28 Jun 2024 09:25:45 -0500

On Fri, Jun 28, 2024 at 4:04 AM Ivan Solovev via SG16 <sg16_at_[hidden]>
wrote:

> Hi,
>
> let me follow up on my original questions.
> After reading P3258, we moved forward with providing std::format support
> for Qt string types. However, we figured out that the implementation is
> suboptimal, and so we have some questions.
>
> Our implementation is now based on std::formatter<std::string_view, char>,
> which means that we take a UTF-16 string (which QString is), convert it to
> UTF-8 using some of our internal methods, and then feed this UTF-8 string
> into std::formatter<std::string_view, char>::format(), which is the then
> responsible for formatting it (which implies one more copy of the data).
>
> As you can see, there is one extra copy here, which we would like to avoid.
> We could do the UTF-16 to UTF-8 conversion codepoint-by-codepoint, and then
> copy individual codepoints to the output, thus saving an allocation of
> a UTf-8 string, but IIUC that would require us to completely reimplement
> the whole parse() and format() logic of the standard formatters.
> That's something we obviously want to avoid.
>
> Based on the above, we have the following questions:
> * are there any plans to provide the possibility to reserve space in the
> output?
> * are there any plans to provide the possibility to write directly to the
> output, assuming it's a contiguous buffer?
>
> Another thing which could be really beneficial for us is the possibility
> to format into a QString (which is a char16_t container).
> However, currently the standard only supports formatting into char and
> wchar_t. Is it going to change? Are there any plans to provide support
> for formatting into charN_t?
>

This is an interesting question. You are right this is a 3 pronged problem.
We do make an assumption that the if the literal encoding is utf-8,
arguments are utf-8 encoded for the purpose of
formatting. In that light we could use the same assumption to convert from
UTF-8 to UTF-16.

I think getting consensus on what the the source encoding of that
conversion to UTF-16 (or worse, the wide (execution?) encoding)
is is going to be a bit more challenging. format has that dual nature of
being text but also letting you
inject arbitrary bytes in the output (which of course would be mojibake
when converting, and that's fine but we need to find the design that
minimizes mojibake).

>
> Best regards,
> Ivan
>
> ------------------------------
>
> Ivan Solovev
> Senior Software Engineer
>
> The Qt Company GmbH
> Erich-Thilo-Str. 10
> 12489 Berlin, Germany
> ivan.solovev_at_[hidden]
> www.qt.io
>
> Geschäftsführer: Mika Pälsi,
> Juha Varelius, Jouni Lintunen
> Sitz der Gesellschaft: Berlin,
> Registergericht: Amtsgericht
> Charlottenburg, HRB 144331 B
>
> ________________________________________
> From: SG16 <sg16-bounces_at_[hidden]> on behalf of Thiago Macieira
> via SG16 <sg16_at_[hidden]>
> Sent: Tuesday, June 11, 2024 4:44 PM
> To: sg16_at_[hidden]
> Cc: Thiago Macieira
> Subject: Re: [isocpp-sg16] std::format and charN_t
>
> On Tuesday 11 June 2024 04:00:14 GMT-7 Ivan Solovev via SG16 wrote:
> > Hi Corentin,
> >
> >
> > > https://wg21.link/P3258
> > > I hope that answers some of your question
> >
> >
> > Thanks for the link. That definitely clarifies the things.
> >
> > As I read it, the paper suggests to use std::text_encoding::literal()
> > to determine the literal encoding and fall back to the execution
> encoding,
> > if we could not get the literal encoding, of if there is no converter to
> > literal encoding. And then transcode all the string arguments into this
> > encoding. Is that correct?
>
> Qt requires that the encoding on Unix systems be UTF-8, so we don't need to
> query that. On Windows, we already know what the local 8-bit encoding is,
> but
> as we discussed in our mailing list, nothing but UTF-8 makes sense anyway,
> so
> we can also assume that there. Qt requires and assumes that narrow string
> literals passed anywhere to its functions are UTF-8 and the legacy Windows
> 8-
> bit encoding ("ANSI") is only used to interact with a handful of likewise
> legacy functions. And explicitly not with the terminal, because that's yet
> another 8-bit encoding.
>
> So the paper above doesn't add new information, but does explain that the
> direction std::format is going is compatible with our thinking.
>
> I do have one question and maybe some answers. First, the paper says
>
> "for each code unit sequence X [that] is a sequence of ill-formed code
> units,
> processing is in order as follows:
> [...]
> - Otherwise ReplacementCharacter is appended to E."
>
> Is this intended to mandate that the full sequence of ill-formed code
> units be
> replaced by a single replacement character? Or could the implementation
> insert
> more than one? I'm personally of the opinion that GIGO, therefore we should
> not mandate in precision how to deal with ill-formed sequences, other than
> that it should output replacement character(s).
>
> In any case, is the paper trying to explain how to do transcoding?
>
> As for answers, the paper asks in the charN_t formatting section:
>
> > How does to_chars and char8_t interract? Unicode has a large set of
> numbers.
>
> I don't think any differently than what's already there. It's only a
> matter of
> encoding. std::to_chars operates only on the C locale, so any other-locale
> encoding must by necessity use something else, regardless of the character
> type in the output.
>
> So the question raises a valid point, but orthogonal to the encoding
> output.
>
> > Are existing locale facilities sufficient to support the needs of
> Unicode?
>
> Probably not, but that's speculation. I don't see anything that we'd need
> of
> them in order to have charN_t formatters. The difficult part may be to get
> the
> number and money punctuation formatters in the right encoding, but if
> that's
> just encoding, we can transcode. There may be some compromises in the
> locale
> database itself that were made to fit the std::locale API available at the
> time: for example, we've had an issue in QLocale where some locales had a
> multi-character delimiter but our API and implementation only supported a
> single UTF-32 character, so we dropped the extra information.
>
> > What do we assume the encoding for char and wchar_t to be?
>
> I'd like to answer, "the same as the charN_t of the same size", but for the
> standard that is not acceptable. We may be able to get away with that for
> wchar_t: are there any implementations that don't use either UTF-16 or
> UTF-32
> for wchar_t? We may as well bless that it is indeed a size-varying type
> that
> is equivalent to the charN_t of equivalent size.
>
> For char, I wish you luck. You probably can't assume anything so there
> must be
> encoding-conversion functions and they must not be inline. Therefore,
> formatting to/from charN_t and char must go through out-of-line transcoding
> calls.
>
> > What is the implementation burden?
>
> Question is vague. Do you mean for the Standard Library implementers or for
> library authors providing formatters?
>
> Assuming it's the former (because of the next question), I would say it is
> a
> non-negligible effort to get right. For one thing, all of the transcoding
> must
> be done out-of-line. For us in Qt, this is actually performance-relevant
> (not
> critical, but it does show up in benchmarks), so we need it to be fast and
> not
> bloat code by getting inlined everywhere. Implementations will need
> therefore
> to feed back into API, so I see no way of knowing what the burden or API
> will
> be until we try both at the same time.
>
> For example, I have looked into the formatter API but failed to find a way
> to
> request of it a buffer of an estimated (upper) size into which to write my
> output and only at the end let it know how much I wrote. I don't know if I
> missed something.
>
> > What is the interaction with user-defined formatters?
>
> I see three possibilities:
> a) user provided a generic-encoding formatter or explicitly for the
> charN_t in
> question, so no further work is needed
>
> b) user provided a wchar_t formatter (and maybe a char one), which may be
> preferable because we know this to be Unicode
>
> c) user only provided a char formatter
>
> Software written in the 2020s is UTF-8-aware so even in case (c) we ought
> to
> be able to use the char formatter and transcode to the required UTF-N
> format.
> The transformation between char and char8_t may also be the identity by
> vendor's choice, reducing the implementation burden.
>
> I would prefer if the standard left unsaid which of the two characters is
> used
> as the formatter before transcoding, but I don't know if that's acceptable.
>
> --
> Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
> Principal Engineer - Intel DCAI Fleet Systems Engineering
>
>
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2024-06-28 14:26:06