C++ Logo

sg16

Advanced search

[isocpp-sg16] Runtime behaviors should not require knowledge of the literal encoding

From: Corentin <corentin.jabot_at_[hidden]>
Date: Fri, 13 Feb 2026 19:19:02 +0100
Hey folks,

This is a follow up to the discussion we had this week in the context
of P3876R0 (Extending <charconv> support to more character types)

I objected (and still do) to some wording, namely:
> The output code points are inserted into the range [first, last) by
encoding them in the respective literal encoding for character literals of
the type of *first.
(I think the paper has two such instance of that wording)

The status quo is that the compiler encodes strings in an encoding
described in the core wording (the literal encoding),
During execution, the library assumes another encoding, the execution
encoding.

By the word of law, these things are completely unrelated today.
Of course this is wrong, and we should fix it
https://isocpp.org/files/papers/P3671R0.pdf

But even if we admit a relation, that relation is not a relation of
equivalence.
It is still common for example to have variance in the representation of
characters not in the basic character set literal.
This is the case for example for ISO 8859 and the various EBCDIC code pages.

So if we admit that the execution encoding need not be exactly the literal
encoding,
it is strange to talk about the literal encoding in the library at all.

So, we should talk about something else.
And because the execution encoding is local-dependant, and because we do
not want from_chars and to_chars to be local dependent, we should talk
about the
execution encoding of the "C" locale - (ie the encoding known by the
library to be associated with the non-locale locale) - there is precedence
in both C and C++.

Because P1880 went nowhere, we should also specify that the encoding
associated with char8_t is UTF-8 (ideally we would put all of that wording
in [library.general] by introducing a term of art,

For example:

The locale-independent text encoding associated with a type T
                                       - the narrow execution encoding
associated with the "C" locale if T is of type cv char*, string_view, string
                                       - the wide execution encoding
associated with the "C" locale if T is of type cv wchar_t*, wstring_view,
wstring
                                       - UTF-8 if T is of type cv
wchar8_t*, u8string_view, u8string,
                                       - ...

Then we can use that definition in the wording of to_chars, from_chars.


One could argue that it does not matter for these two functions.
Indeed, these families of functions consume and produce characters that are
in the basic character set so if you assume a world where P3671 is adopted,
their representation will always be the same in the literal and execution
encodings.

However, this is only true of these functions, and the paper proposed
wording relies too much on accidental happenstance and cannot be
generalized to other
functions such as C character classification functions.

In a world where we do not admit P3671, it would make from_chars/to_chars
inconsistent with strto_/ato_. Which seems undesirable.

It might seem a bit academic, however these are indeed implementation
concerns.
As proposed by the paper an implementation to have encoding/decoding tables
for whatever the literal encoding is, which is not the case today.
And I'd rather wording that can be reused / exhibit a consistent model
rather than "let's use the literal encoding for to_chars because 0 is in
the basic character set and infinity is spelled INF rather than ∞ by printf
so we are fine"

I want to reiterate that the wording in P3876R0 is very novel indeed.

Victor is correct that std::format does something similar to P3876R0 in a
couple of places, i.e. the escaping of strings and in [time.format].
We should tweak these.

However, neither width estimation (which just ask the implementation to
assume that the string is in some encoding that the string has to pick),
or printing ( which just says "if the literal encoding was utf-8 then
assume the string is utf-8 and use vprint_unicode, which is consistent with
P3671 and perfectly fine - and this is a behavior decided at compile time,
not a runtime behavior).

Other parts of the standard correctly refers to the execution encoding of
the "C" locale, or refers to strings produced at compile times (reflection,
contracts).

(Regardless of what we do, it does not affect that from_char will
remain locale independent, and, IFF we adopt P3671, there will be no
observable behavior difference)

Cheers.

Received on 2026-02-13 18:19:24