C++ Logo

sg16

Advanced search

Re: [isocpp-sg16] std::format and charN_t

From: Victor Zverovich <victor.zverovich_at_[hidden]>
Date: Fri, 28 Jun 2024 07:09:13 -0700
> are there any plans to provide the possibility to reserve space in the
output?
> are there any plans to provide the possibility to write directly to the
output, assuming it's a contiguous buffer?

Not yet. {fmt} has internal mechanisms for doing this but they need some
work to make them ready for general use.

- Victor

On Fri, Jun 28, 2024 at 2:04 AM Ivan Solovev via SG16 <sg16_at_[hidden]>
wrote:

> Hi,
>
> let me follow up on my original questions.
> After reading P3258, we moved forward with providing std::format support
> for Qt string types. However, we figured out that the implementation is
> suboptimal, and so we have some questions.
>
> Our implementation is now based on std::formatter<std::string_view, char>,
> which means that we take a UTF-16 string (which QString is), convert it to
> UTF-8 using some of our internal methods, and then feed this UTF-8 string
> into std::formatter<std::string_view, char>::format(), which is the then
> responsible for formatting it (which implies one more copy of the data).
>
> As you can see, there is one extra copy here, which we would like to avoid.
> We could do the UTF-16 to UTF-8 conversion codepoint-by-codepoint, and then
> copy individual codepoints to the output, thus saving an allocation of
> a UTf-8 string, but IIUC that would require us to completely reimplement
> the whole parse() and format() logic of the standard formatters.
> That's something we obviously want to avoid.
>
> Based on the above, we have the following questions:
> * are there any plans to provide the possibility to reserve space in the
> output?
> * are there any plans to provide the possibility to write directly to the
> output, assuming it's a contiguous buffer?
>
> Another thing which could be really beneficial for us is the possibility
> to format into a QString (which is a char16_t container).
> However, currently the standard only supports formatting into char and
> wchar_t. Is it going to change? Are there any plans to provide support
> for formatting into charN_t?
>
> Best regards,
> Ivan
>
> ------------------------------
>
> Ivan Solovev
> Senior Software Engineer
>
> The Qt Company GmbH
> Erich-Thilo-Str. 10
> 12489 Berlin, Germany
> ivan.solovev_at_[hidden]
> www.qt.io
>
> Geschäftsführer: Mika Pälsi,
> Juha Varelius, Jouni Lintunen
> Sitz der Gesellschaft: Berlin,
> Registergericht: Amtsgericht
> Charlottenburg, HRB 144331 B
>
> ________________________________________
> From: SG16 <sg16-bounces_at_[hidden]> on behalf of Thiago Macieira
> via SG16 <sg16_at_[hidden]>
> Sent: Tuesday, June 11, 2024 4:44 PM
> To: sg16_at_[hidden]
> Cc: Thiago Macieira
> Subject: Re: [isocpp-sg16] std::format and charN_t
>
> On Tuesday 11 June 2024 04:00:14 GMT-7 Ivan Solovev via SG16 wrote:
> > Hi Corentin,
> >
> >
> > > https://wg21.link/P3258
> > > I hope that answers some of your question
> >
> >
> > Thanks for the link. That definitely clarifies the things.
> >
> > As I read it, the paper suggests to use std::text_encoding::literal()
> > to determine the literal encoding and fall back to the execution
> encoding,
> > if we could not get the literal encoding, of if there is no converter to
> > literal encoding. And then transcode all the string arguments into this
> > encoding. Is that correct?
>
> Qt requires that the encoding on Unix systems be UTF-8, so we don't need to
> query that. On Windows, we already know what the local 8-bit encoding is,
> but
> as we discussed in our mailing list, nothing but UTF-8 makes sense anyway,
> so
> we can also assume that there. Qt requires and assumes that narrow string
> literals passed anywhere to its functions are UTF-8 and the legacy Windows
> 8-
> bit encoding ("ANSI") is only used to interact with a handful of likewise
> legacy functions. And explicitly not with the terminal, because that's yet
> another 8-bit encoding.
>
> So the paper above doesn't add new information, but does explain that the
> direction std::format is going is compatible with our thinking.
>
> I do have one question and maybe some answers. First, the paper says
>
> "for each code unit sequence X [that] is a sequence of ill-formed code
> units,
> processing is in order as follows:
> [...]
> - Otherwise ReplacementCharacter is appended to E."
>
> Is this intended to mandate that the full sequence of ill-formed code
> units be
> replaced by a single replacement character? Or could the implementation
> insert
> more than one? I'm personally of the opinion that GIGO, therefore we should
> not mandate in precision how to deal with ill-formed sequences, other than
> that it should output replacement character(s).
>
> In any case, is the paper trying to explain how to do transcoding?
>
> As for answers, the paper asks in the charN_t formatting section:
>
> > How does to_chars and char8_t interract? Unicode has a large set of
> numbers.
>
> I don't think any differently than what's already there. It's only a
> matter of
> encoding. std::to_chars operates only on the C locale, so any other-locale
> encoding must by necessity use something else, regardless of the character
> type in the output.
>
> So the question raises a valid point, but orthogonal to the encoding
> output.
>
> > Are existing locale facilities sufficient to support the needs of
> Unicode?
>
> Probably not, but that's speculation. I don't see anything that we'd need
> of
> them in order to have charN_t formatters. The difficult part may be to get
> the
> number and money punctuation formatters in the right encoding, but if
> that's
> just encoding, we can transcode. There may be some compromises in the
> locale
> database itself that were made to fit the std::locale API available at the
> time: for example, we've had an issue in QLocale where some locales had a
> multi-character delimiter but our API and implementation only supported a
> single UTF-32 character, so we dropped the extra information.
>
> > What do we assume the encoding for char and wchar_t to be?
>
> I'd like to answer, "the same as the charN_t of the same size", but for the
> standard that is not acceptable. We may be able to get away with that for
> wchar_t: are there any implementations that don't use either UTF-16 or
> UTF-32
> for wchar_t? We may as well bless that it is indeed a size-varying type
> that
> is equivalent to the charN_t of equivalent size.
>
> For char, I wish you luck. You probably can't assume anything so there
> must be
> encoding-conversion functions and they must not be inline. Therefore,
> formatting to/from charN_t and char must go through out-of-line transcoding
> calls.
>
> > What is the implementation burden?
>
> Question is vague. Do you mean for the Standard Library implementers or for
> library authors providing formatters?
>
> Assuming it's the former (because of the next question), I would say it is
> a
> non-negligible effort to get right. For one thing, all of the transcoding
> must
> be done out-of-line. For us in Qt, this is actually performance-relevant
> (not
> critical, but it does show up in benchmarks), so we need it to be fast and
> not
> bloat code by getting inlined everywhere. Implementations will need
> therefore
> to feed back into API, so I see no way of knowing what the burden or API
> will
> be until we try both at the same time.
>
> For example, I have looked into the formatter API but failed to find a way
> to
> request of it a buffer of an estimated (upper) size into which to write my
> output and only at the end let it know how much I wrote. I don't know if I
> missed something.
>
> > What is the interaction with user-defined formatters?
>
> I see three possibilities:
> a) user provided a generic-encoding formatter or explicitly for the
> charN_t in
> question, so no further work is needed
>
> b) user provided a wchar_t formatter (and maybe a char one), which may be
> preferable because we know this to be Unicode
>
> c) user only provided a char formatter
>
> Software written in the 2020s is UTF-8-aware so even in case (c) we ought
> to
> be able to use the char formatter and transcode to the required UTF-N
> format.
> The transformation between char and char8_t may also be the identity by
> vendor's choice, reducing the implementation burden.
>
> I would prefer if the standard left unsaid which of the two characters is
> used
> as the formatter before transcoding, but I don't know if that's acceptable.
>
> --
> Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
> Principal Engineer - Intel DCAI Fleet Systems Engineering
>
>
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2024-06-28 14:09:28