ISOCPP sg16 List: Re: Follow up on SG16 review of P2996R2 (Reflection for C++26)

From: Tom Honermann <tom_at_[hidden]>
Date: Thu, 2 May 2024 17:25:50 -0400

On 4/30/24 2:31 AM, Corentin Jabot via SG16 wrote:
>
>
> On Tue, Apr 30, 2024 at 12:45 AM Tom Honermann <tom_at_[hidden]> wrote:
>
> On 4/29/24 4:11 PM, Peter Dimov via SG16 wrote:
> > Tom Honermann wrote:
> >> I'm not entirely sure that cout << std::format("{}",
> u8"...") is that much
> >> easier
> >> to specify and support.
> >>
> >> But I'll be glad to be proven wrong, of course. :-)
> >>
> >> There is a relevant SO comment
> >> <https://stackoverflow.com/questions/58878651/what-is-the-printf-
> >> formatting-character-for-char8-t/58895428#58895428> .
> >>
> >> std::format() and std::print(), to some extent, improve the
> likelihood that an
> >> implementation selected encoding will be a good match for the
> programmer's
> >> intent. This is because:
> >>
> >> 1. std::format() and std::print() are not implicitly locale
> dependent; that
> >> rules out selection of a locale dependent execution encoding.
> >> 2. std::format() returns a std::string; that rules out
> selection of an I/O
> >> dependent encoding.
> >> 3. std::print() writes to an I/O stream, but has special
> behavior for writes
> >> to a terminal; that rules out selection of a terminal encoding
> (as unnecessary,
> >> at least in important cases).
> >> 4. std::format() and std::print() are both strongly
> associated with the
> >> ordinary/wide literal encoding.
> >> 5. std::format() and std::print() should have the same
> behavior (other than
> >> that std::print(...) may produce a better result than std::cout <<
> >> std::format(...) when the output is directed to a terminal).
> >> 6. std::format() and std::print() have additional guarantees
> when the
> >> ordinary/wide literal encoding is a UTF encoding.
> >>
> >>
> >> Due to those characteristics, we have good motivation for
> implicit use of the
> >> ordinary/wide literal encoding as the target for transcoding
> for std::format()
> >> and std::print().
> > I'm afraid that I don't quite understand.
> >
> > What does std::format( "{}", u8"..." ) actually do? I suppose it
> transcodes
> > the UTF-8 input into the narrow literal encoding (replacing
> irrepresentable
> > characters with '?' instead of throwing, I presume, or it would
> be not very
> > usable)?
>
> We'll have to see what Corentin proposes :)
>
> But yes, something very much like that.
>
> Note that we could also support std::format("{:L}", u8"...") to
> enable a
> programmer to explicitly request transcoding to a locale dependent
> encoding (either now or at some future point).
>
> (Corentin, at a minimum, we should reserve the L option in your
> paper).
>
>
> We have an opportunity to not conflate locale and encodings here.
As much as I would like that to be the case, I don't think it is.
> u8"" is a known quantity here, it's utf-8.
> But the target is also a known quantity, we very clearly decided it to
> be the literal encoding, because we need to parse it, and
> we wisely decided to assume a literal encoding. So the target encoding
> is also a known quantity

Unfortunately, that isn't the case when a programmer opts in to use of a
locale. Consider the following when the literal encoding is any ASCII
derived encoding and the global locale encoding is EUC-JP (ujis).

#include <chrono>
#include <format>
#include <iostream>
#include <locale>
int main() {
   std::locale::global(std::locale(""));
   std::cout << std::format("{:L}\n", std::chrono::August);
}

The resulting string will be formed from the literal encoding (for the
'\n' character) and the name of the month provided by the /formatting
locale <http://eel.is/c++draft/time.format#2>/. Nothing ensures that the
latter is converted to the literal encoding. Further, a validly encoded
string is produced so long as the characters used in the format string
are from the basic literal character set.

In my environment (Linux, using a pre-release build of Clang 19 and
libc++), compiling the above with the default literal encoding (UTF-8)
and running it with LANG=ja_JP.ujis produces output in EUC-jp as
expected; note the iconv invocation.

$ clang++ -std=c++23 -stdlib=libc++ t.cpp -o t
$ LANG=ja_JP.ujis ./t | iconv -f ujis -t utf-8
  8月

(yes, that is the right output, it is convention for some translation of
month names to include the month number before the localized name).

Long time SG16 participants will recall P2373R3 (Fixing locale handling
in chrono formatters) <https://wg21.link/p2372r3> and LWG 3547
<https://wg21.link/lwg3547>. There was relevant discussion during the
2021-04-28 SG16 meeting
<https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2021.md#april-28th-2021>.

I have vague recollections of discussions about requiring that locale
dependent translations be provided in the literal encoding when it is a
UTF one, but I haven't been able to identify any such recorded
discussion. I don't see anything in the current WP that would require this.

Based on the above, I think that, at a minimum, the "L" option should be
reserved.

>
>
>
> >
> > And then we just fall back to std::cout << "...", where the
> "..." is in the
> > narrow literal encoding and hence we assume works, more or less.
> Correct.
> >
> > And we don't want to make std::cout << u8"..." do that, because
> it can,
> > in principle, do better?
> Not because it can do better, but because there is more uncertainty
> about what the user might expect. If the user writes std::cout <<
> std::format(...), then that is an explicit opt in to the behavior
> that
> std::format() exhibits. But they might also want to just write UTF-8
> bytes unmodified regardless of what the ordinary literal encoding
> is. Or
> they might expect implicit transcoding to either the current
> locale or
> the environment locale or even the terminal locale. By not
> providing a
> default behavior, we give the programmer the opportunity to think
> about
> what they are actually trying to do.
>
>
> I don't quite buy this argument.
> When cout << 42.0; outputs "42,0", the text nature, locale and
> encodings were made for us.
> If the programmer wants to be creative, one can consider io manipulators.

Consider printing of other localized names as in the example above.

#include <chrono>
#include <format>
#include <iostream>
#include <iomanip>
#include <locale>
int main() {
   std::cout << "Default locale: '" << std::cout.getloc().name() << "'\n";
   std::cout << std::chrono::August << "\n";
   std::cout.imbue(std::locale(""));
   std::cout << "Environment locale: '" << std::cout.getloc().name() <<
"'\n";
   std::cout << std::chrono::August << "\n";
   std::cout.imbue(std::locale("ja_JP.utf8"));
   std::cout << "Explicit locale: '" << std::cout.getloc().name() << "'\n";
   std::cout << std::chrono::August << "\n";
}

I get the following output running that locally with LANG=ja_JP.ujis.
Note the mojibake and corresponding substitution of replacement characters.

Default locale: 'C'
Aug
Environment locale: ''
  8��
Explicit locale: 'ja_JP.utf8'
  8月

The (well recognized) problem with iostreams is the implicit use of the
imbued locale. The consistent behavior for iostreams would be that
inserters and extractors for charN_t would transcode to the encoding of
the imbued locale. But that doesn't work well at all in the common case
where no locale has been explicitly imbued.

Making a choice for std::format() is simpler because the programmer
chooses the locale behavior on a per-argument basis; there is a good
default.

> >
> > But let me get back to your list.
> >
> >> 1. std::format() and std::print() are not implicitly locale
> dependent; that
> >> rules out selection of a locale dependent execution encoding.
> > What is in a locale-dependent execution encoding in std::cout <<
> u8"..."?
> iostreams implicitly consults either an imbued locale facet or the
> global locale for formatting operations. Think about std::cout <<
> std::chrono::Sunday. Depending on the current locale, this might
> print
> "Sun" or a localized weekday name in a locale dependent encoding.
>
>
> But again, the only thing we care about for u8 is the encoding.
> And I am not aware of std::locale ever impacting that.
I hope the above examples are motivating.
>
> >
> >> 2. std::format() returns a std::string; that rules out
> selection of an I/O
> >> dependent encoding.
> > Same question. Where is the I/O dependent encoding in std::cout
> << u8"..."
> > (that is not also present in std::cout << some_std_string)?
> In the latter case, we have to assume that some_std_string holds
> text in
> the encoding expected on the other end of the stream. We can't do
> that
> for u8"...", so we have to transcode to something (or have some other
> assurance that UTF-8 is intended and expected).
> >
> >> 3. std::print() writes to an I/O stream, but has special
> behavior for writes
> >> to a terminal; that rules out selection of a terminal encoding
> (as unnecessary,
> >> at least in important cases).
>
> > This doesn't apply here, because we're using std::format.
>
>
> Right, this is one of the reasons I feel less compelled to pursue
> iostream surgery.
> Output behavior is suboptimal on windows, and unlikely to be fixed.

I am likewise not compelled to pursue iostream support.

Agreed with later remarks below.

Tom.

> >> 5. std::format() and std::print() should have the same
> behavior (other than
> >> that std::print(...) may produce a better result than std::cout <<
> >> std::format(...) when the output is directed to a terminal).
> > OK... but this isn't relevant.
> The above two are relevant because we wouldn't want to differentiate
> behavior for formatting a u8"..." argument for std::format() vs
> std::print(). The latter helps to constrain the reasonable options
> for
> the former.
>
> Right, print just does format and output the result
>
> >
> >> 6. std::format() and std::print() have additional guarantees
> when the
> >> ordinary/wide literal encoding is a UTF encoding.
> > What additional guarantees, and how do they help here?
>
> We specify additional constraints for fill characters, display width
> (well, normative encouragement), and formatting of escaped
> strings. None
> of these are relevant for reflection purposes; they help to
> reinforce a
> choice to depend on the ordinary/wide literal encoding for
> behavior of
> these functions. We don't have such precedent for iostreams.
>
>
> And you know, the format string is parsed in the ordinary encoding and
> copied as-it
>
>
> Tom.
>
>

Received on 2024-05-02 21:25:55