C++ Logo

sg16

Advanced search

Re: Follow up on SG16 review of P2996R2 (Reflection for C++26)

From: Tom Honermann <tom_at_[hidden]>
Date: Fri, 3 May 2024 16:50:24 -0400
On 5/2/24 7:10 PM, Peter Dimov via SG16 wrote:
>> On Thu, May 2, 2024 at 11:25 PM Tom Honermann <tom_at_[hidden]
>> <mailto:tom_at_[hidden]> > wrote:
>>
>> The (well recognized) problem with iostreams is the implicit use of the
>> imbued locale. The consistent behavior for iostreams would be that inserters
>> and extractors for charN_t would transcode to the encoding of the imbued
>> locale.
> The _streams_ do not transcode using the codecvt facet of the imbued
> locale. The inserter of `char const*`, for instance, doesn't transcode using
> codecvt. It passes the NTCS to the streambuf as-is. (*)
>
> https://eel.is/c++draft/ostream.inserters.character#4
>
> The _streambuf_ then transcodes using the codecvt facet of the imbued
> locale.
>
> https://eel.is/c++draft/filebuf#general-7
> https://eel.is/c++draft/filebuf#virtuals-10

I agree; that is how iostreams is intended to work.

The scenario that I've been discussing is independent of std::codecvt.
The transcoding that I'm arguing for would not use a std::codecvt facet;
it would use the encoding associated with the locale (which is known by
the implementation, but not exposed by either the C locale interface or
by std::locale).

>
> From the fact that everyone expects inserting an NTCS in the literal encoding
> to work:
>
> std::cout << "Hello, world!" << std::endl;
>
> we can deduce that the streambuf takes a character sequence in the
> literal encoding, which it then transcodes using codecvt.

Unfortunately, no, that is demonstrably not correct.

    $ cat t.cpp
    #include <chrono>
    #include <iostream>
    #include <locale>
    int main() {
       std::cout.imbue(std::locale(""));
       std::cout << "In the month of " << std::chrono::August << "\n";
    }

    $ clang++ -std=c++23 -stdlib=libc++ t.cpp -o t

    $ LANG=ja_JP.sjis ./t
    In the month of 8��

    $ LANG=ja_JP.sjis ./t | iconv -f shift-jis -t utf-8
    In the month of 8月

The above produced valid Shift-JIS output (the mojibake is because my
terminal expected UTF-8; hence the second run with output filtered by
iconv).

I acknowledge the point about the streambuf performing transcoding using
the imbued std::codecvt facet. In my example, that transcoding was a
no-op (assuming the std::cout streambuf even uses it; as already discussed).

If I had been writing to a stream with a std::filebuf buffer and an
imbued std::codecvt facet that was not a no-op, that facet would have
had to expect a character sequence in the locale encoding for the
intended output to be reliably produced.

We can deduce the following:

 1. When the imbued locale is the "C" locale, the streambuf receives a
    character sequence in the ordinary literal encoding.
 2. When the imbued locale is a different encoding, the streambuf
    receives a character sequence in the locale dependent encoding.

The second case requires that literals written to the stream use only
characters that have consistent representation in the locale dependent
encoding in order to avoid mojibake.

>
> Therefore, the inserter needs to produce a character sequence in the literal
> encoding. It can't transcode to the final encoding using codecvt, because the
> streambuf will transcode a second time, ruining everything.
I agree it can't transcode to a final encoding.
>
> Therefore, the inserters of `char8_t const*`, `char16_t const*` and `char32_t
> const*` need to transcode from UTF-8, UTF-16 and UTF-32, respectively, to a
> character sequence in the literal encoding (or a superset of it), which then to
> feed to the streambuf (which will then transcode using codecvt.)
Per above, I strongly disagree with that when the locale is not the "C"
locale.
>
> That's coincidentally exactly what inserting the result of std::format does
> after Corentin's proposed additions.

Not always. An expression like std::cout << std::format(...) <<
std::chrono::August consults multiple distinct locale objects. In a
program operating with a locale other than the "C" locale, the
programmer must know what they are doing to get a consistently encoded
result. This is motivation for not mixing std::format() output with use
of other ostream inserters; just use std::format().

Fortunately, the std::format() design puts control where it belongs; in
a place where the programmer can opt-in to use of the locale in the
precise places they need it.

>
> (*) Well, technically it does "transcode" from char to the stream type using
> ctype::widen, but that's useless for multibyte encodings, so we can reasonably
> assume that widening a char to a char is the identity, or a literal encoding of
> UTF-8 would stand no chance of working.
>
> So:
>
> input -> inserter -> character sequence in literal encoding -> streambuf ->
> output in final encoding determined by locale codecvt

Unfortunately, no.

Tom.

Received on 2024-05-03 20:50:28