C++ Logo

sg16

Advanced search

Re: Follow up on SG16 review of P2996R2 (Reflection for C++26)

From: Tom Honermann <tom_at_[hidden]>
Date: Fri, 3 May 2024 17:22:35 -0400
On 4/30/24 6:17 AM, Peter Dimov via SG16 wrote:
> Jens Maurer wrote:
>> Hm... We currently specify that std::fstream considers the imbued locale's
>> encoding, but we seem to say nothing about std::cout. Even though one might
>> reasonably expect that it also considers the imbued locale to perform
>> transcoding to the output.
> It's a bit vague but std::cout ishttps://eel.is/c++draft/narrow.stream.objects#3
>
> "The object cout controls output to a stream buffer associated with the object
> stdout, declared in <cstdio>."
>
> which strongly implies a `filebuf` that writes to `stdout`, even though it's not
> required to be literally that and can be e.g. of type __stdout_streambuf.
>
> std::cout is actually the easy case. std::cout << x, for any x, must serialize x
> into a sequence of `char`, which then to pass to its streambuf; the streambuf
> uses codecvt::out to transcode, but codecvt<char, char, mbstate_t> is a no-op.
>
> So in the "normal" case of nothing imbued, and from the fact that
>
> std::cout << "Hello, world!" << std::endl;
>
> is expected to work, we can deduce that characters in the literal encoding
> end up in the streambuf and then are written to stdout, with no translation.
>
> And since in
>
> std::cout << "Hello, " << u8"world!" << std::endl;
>
> the characters "Hello, " and the result of serialization of u8"world!" to char[]
> end up in the same char[] buffer, with no associated metadata to tell the
> streambuf which specific `char` is in what encoding, we can further deduce
> that the serialized u8"world!" has to consist of characters in the literal
> encoding (or a superset of it.)
>
> There's simply no other option.
As I replied elsewhere, I don't agree.
>
> std::wcout << L"Привет!" << std::endl (where the wide literal encoding is
> UTF-16, but the narrow literal encoding is ISO-8859-1) is the hard case. But
> I think we've given up on that.

We (or at least I) haven't given up on that yet. This is tracked by SG16
33 <https://github.com/sg16-unicode/sg16/issues/33>.

I don't think ISO-8859-1 is problematic; none of the characters that it
can encode correspond to characters outside the BMP. The problematic
case is UTF-8. But what is interesting is that I so far have been unable
to produce a test case that behaves badly on any implementation. I'm not
done exploring that yet, but it so far appears to be the case that all
of the implementations have adequate std::mbstate_t implementations that
make this work (and they use them).

>
> The hypothetical u8cout (whose streambuf is basic_streambuf<char8_t>)
> will of course do the opposite, pass through u8"..." and transcode "...", but
> we can worry about that when we get it, which will be never.
>
> TL;DR
>
> std::cout << "prefix " << u8"..." << " suffix\n";
>
> and
>
> std::cout << std::format("prefix {} suffix\n", u8"...");
>
> are equivalent, and the same reasoning applies to both. In both cases,
> the narrow literals and the u8 literal are serialized to a single char[]
> buffer, in the narrow literal encoding.
>
> And once that is done, the imbued locale, if any, is applied to both in
> the exact same manner.

The imbued std::codecvt facet, yes. That isn't the concern I have. The
concern is localized text produced by std::format() and/or iostream
inserters that is not encoded in the ordinary literal encoding.

Tom.

>
>

Received on 2024-05-03 21:22:38