C++ Logo

sg16

Advanced search

Re: [isocpp-sg16] std::format and charN_t

From: Thiago Macieira <thiago_at_[hidden]>
Date: Tue, 11 Jun 2024 07:44:20 -0700
On Tuesday 11 June 2024 04:00:14 GMT-7 Ivan Solovev via SG16 wrote:
> Hi Corentin,
>
>
> > https://wg21.link/P3258
> > I hope that answers some of your question
>
>
> Thanks for the link. That definitely clarifies the things.
>
> As I read it, the paper suggests to use std::text_encoding::literal()
> to determine the literal encoding and fall back to the execution encoding,
> if we could not get the literal encoding, of if there is no converter to
> literal encoding. And then transcode all the string arguments into this
> encoding. Is that correct?

Qt requires that the encoding on Unix systems be UTF-8, so we don't need to
query that. On Windows, we already know what the local 8-bit encoding is, but
as we discussed in our mailing list, nothing but UTF-8 makes sense anyway, so
we can also assume that there. Qt requires and assumes that narrow string
literals passed anywhere to its functions are UTF-8 and the legacy Windows 8-
bit encoding ("ANSI") is only used to interact with a handful of likewise
legacy functions. And explicitly not with the terminal, because that's yet
another 8-bit encoding.

So the paper above doesn't add new information, but does explain that the
direction std::format is going is compatible with our thinking.

I do have one question and maybe some answers. First, the paper says

"for each code unit sequence X [that] is a sequence of ill-formed code units,
processing is in order as follows:
[...]
- Otherwise ReplacementCharacter is appended to E."

Is this intended to mandate that the full sequence of ill-formed code units be
replaced by a single replacement character? Or could the implementation insert
more than one? I'm personally of the opinion that GIGO, therefore we should
not mandate in precision how to deal with ill-formed sequences, other than
that it should output replacement character(s).

In any case, is the paper trying to explain how to do transcoding?

As for answers, the paper asks in the charN_t formatting section:

> How does to_chars and char8_t interract? Unicode has a large set of numbers.

I don't think any differently than what's already there. It's only a matter of
encoding. std::to_chars operates only on the C locale, so any other-locale
encoding must by necessity use something else, regardless of the character
type in the output.

So the question raises a valid point, but orthogonal to the encoding output.

> Are existing locale facilities sufficient to support the needs of Unicode?

Probably not, but that's speculation. I don't see anything that we'd need of
them in order to have charN_t formatters. The difficult part may be to get the
number and money punctuation formatters in the right encoding, but if that's
just encoding, we can transcode. There may be some compromises in the locale
database itself that were made to fit the std::locale API available at the
time: for example, we've had an issue in QLocale where some locales had a
multi-character delimiter but our API and implementation only supported a
single UTF-32 character, so we dropped the extra information.

> What do we assume the encoding for char and wchar_t to be?

I'd like to answer, "the same as the charN_t of the same size", but for the
standard that is not acceptable. We may be able to get away with that for
wchar_t: are there any implementations that don't use either UTF-16 or UTF-32
for wchar_t? We may as well bless that it is indeed a size-varying type that
is equivalent to the charN_t of equivalent size.

For char, I wish you luck. You probably can't assume anything so there must be
encoding-conversion functions and they must not be inline. Therefore,
formatting to/from charN_t and char must go through out-of-line transcoding
calls.

> What is the implementation burden?

Question is vague. Do you mean for the Standard Library implementers or for
library authors providing formatters?

Assuming it's the former (because of the next question), I would say it is a
non-negligible effort to get right. For one thing, all of the transcoding must
be done out-of-line. For us in Qt, this is actually performance-relevant (not
critical, but it does show up in benchmarks), so we need it to be fast and not
bloat code by getting inlined everywhere. Implementations will need therefore
to feed back into API, so I see no way of knowing what the burden or API will
be until we try both at the same time.

For example, I have looked into the formatter API but failed to find a way to
request of it a buffer of an estimated (upper) size into which to write my
output and only at the end let it know how much I wrote. I don't know if I
missed something.

> What is the interaction with user-defined formatters?

I see three possibilities:
a) user provided a generic-encoding formatter or explicitly for the charN_t in
question, so no further work is needed

b) user provided a wchar_t formatter (and maybe a char one), which may be
preferable because we know this to be Unicode

c) user only provided a char formatter

Software written in the 2020s is UTF-8-aware so even in case (c) we ought to
be able to use the char formatter and transcode to the required UTF-N format.
The transformation between char and char8_t may also be the identity by
vendor's choice, reducing the implementation burden.

I would prefer if the standard left unsaid which of the two characters is used
as the formatter before transcoding, but I don't know if that's acceptable.

-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
  Principal Engineer - Intel DCAI Fleet Systems Engineering

Received on 2024-06-11 14:44:26