Date: Sat, 4 May 2024 00:41:43 -0400
On 5/3/24 8:42 PM, Peter Dimov wrote:
> Tom Honermann wrote:
>> We can deduce the following:
>>
>> 1. When the imbued locale is the "C" locale, the streambuf receives a
>> character sequence in the ordinary literal encoding.
>> 2. When the imbued locale is a different encoding, the streambuf receives
>> a character sequence in the locale dependent encoding.
>>
>> The second case requires that literals written to the stream use only characters
>> that have consistent representation in the locale dependent encoding in order
>> to avoid mojibake.
> I see what you are saying, but I don't think this is what we want to support
> going forward.
It isn't a matter of wanting to support it. I feel a strong obligation
to ensure existing locale dependent code continues to be maintainable.
There are multiple ecosystems that depend on it.
>
> You are saying that (assuming narrow literal encoding UTF-8) this doesn't work
>
> std::cout << std::chrono::August << "に" << std::endl;
>
> when LC_TIME=ja_JP.sjis, but we can hypothetically make this work
>
> std::cout << std::chrono::August << u8"に" << std::endl;
>
> by having the ostream transcode the UTF-8 literal into Shift-JIS.
We /could/ do that, yes. But...
>
> I don't think we should do that.
I'm not convinced we should either.
> I think that these two statements, when the
> narrow literal encoding is UTF-8, must do the exact same thing.
I very much appreciate why. I agree that it seems crazy that two string
literals, both UTF-8 encoded and with the same contents, would produce
different results.
The problem is that the first example does not reliably produce
consistently encoded output. If it did, then there would be no question
as to what the second example should do.
There isn't any way to make the first case "just work" since we have no
way to differentiate ordinary string literals in one encoding vs strings
in the locale encoding; iostreams is constrained to just copying the
bytes and hoping the programmer understood the consequences. However,
consistently encoded output that very likely matches the programmer's
intent is produced when the locale encoding is "C" or UTF-8 and it is
worth noting that the implicit transcoding behavior I described would
produce the same results in those cases. The only cases in which the
results would differ are exactly those where the first case produces
inconsistently encoded output. The reason we can "fix" the second case
is because the encoding ambiguities have been resolved.
I think it is worth pointing out that the problems with the first
example are not specific to UTF-8. If the ordinary literal encoding was
Windows-1252, the string literal was "¤", and the locale encoding was
Windows-1255, then the "¤" character might be interpreted as "₪"
instead; likely contrary to the programmer's intent.
>
> And so should these two:
>
> std::wcout << std::chrono::August << "に" << std::endl;
> std::wcout << std::chrono::August << u8"に" << std::endl;
>
> I don't believe using the locale encoding for the intermediate representation
> of the character sequences passed to the streambuf is sound, and I don't think
> trying to support this case will lead us anywhere useful.
This already is, and has been for a long time, the status quo as
demonstrated.
I demonstrated the problem using std::chrono, but the same encoding
confusion occurs with older iostream manipulators like std::put_time().
>
> The future we want is narrow literal encoding of UTF-8, with the streambuf
> receiving character sequences in UTF-8, with the final encoding produced by
> the codecvt facet in the streambuf.
That is already the case when the locale is "C" or uses UTF-8. When
neither of those is the case, this goal is already not achievable
exactly because iostreams implicitly uses the locale.
Applications on Windows that want to override the environment locale
encoding and always use UTF-8, can do so with a call to
std::locale::global(std::locale(".utf8")) and imbuing that local in
std::cout and std::cerr.
>
> The locale categories in that future determine the month names, but not
> their encoding.
We can't get there without breaking existing code.
>
> I don't quite know how we get there, but I'm pretty sure transcoding UTF-8
> to Shift-JIS in the inserters isn't how.
We have a good option for std::format(). For iostreams, the status quo
is that the programmer has to explicitly code their intent. If we want
to provide implicit behavior, we can't ignore the locale encoding
concerns without introducing inconsistency in the library.
Tom.
> Tom Honermann wrote:
>> We can deduce the following:
>>
>> 1. When the imbued locale is the "C" locale, the streambuf receives a
>> character sequence in the ordinary literal encoding.
>> 2. When the imbued locale is a different encoding, the streambuf receives
>> a character sequence in the locale dependent encoding.
>>
>> The second case requires that literals written to the stream use only characters
>> that have consistent representation in the locale dependent encoding in order
>> to avoid mojibake.
> I see what you are saying, but I don't think this is what we want to support
> going forward.
It isn't a matter of wanting to support it. I feel a strong obligation
to ensure existing locale dependent code continues to be maintainable.
There are multiple ecosystems that depend on it.
>
> You are saying that (assuming narrow literal encoding UTF-8) this doesn't work
>
> std::cout << std::chrono::August << "に" << std::endl;
>
> when LC_TIME=ja_JP.sjis, but we can hypothetically make this work
>
> std::cout << std::chrono::August << u8"に" << std::endl;
>
> by having the ostream transcode the UTF-8 literal into Shift-JIS.
We /could/ do that, yes. But...
>
> I don't think we should do that.
I'm not convinced we should either.
> I think that these two statements, when the
> narrow literal encoding is UTF-8, must do the exact same thing.
I very much appreciate why. I agree that it seems crazy that two string
literals, both UTF-8 encoded and with the same contents, would produce
different results.
The problem is that the first example does not reliably produce
consistently encoded output. If it did, then there would be no question
as to what the second example should do.
There isn't any way to make the first case "just work" since we have no
way to differentiate ordinary string literals in one encoding vs strings
in the locale encoding; iostreams is constrained to just copying the
bytes and hoping the programmer understood the consequences. However,
consistently encoded output that very likely matches the programmer's
intent is produced when the locale encoding is "C" or UTF-8 and it is
worth noting that the implicit transcoding behavior I described would
produce the same results in those cases. The only cases in which the
results would differ are exactly those where the first case produces
inconsistently encoded output. The reason we can "fix" the second case
is because the encoding ambiguities have been resolved.
I think it is worth pointing out that the problems with the first
example are not specific to UTF-8. If the ordinary literal encoding was
Windows-1252, the string literal was "¤", and the locale encoding was
Windows-1255, then the "¤" character might be interpreted as "₪"
instead; likely contrary to the programmer's intent.
>
> And so should these two:
>
> std::wcout << std::chrono::August << "に" << std::endl;
> std::wcout << std::chrono::August << u8"に" << std::endl;
>
> I don't believe using the locale encoding for the intermediate representation
> of the character sequences passed to the streambuf is sound, and I don't think
> trying to support this case will lead us anywhere useful.
This already is, and has been for a long time, the status quo as
demonstrated.
I demonstrated the problem using std::chrono, but the same encoding
confusion occurs with older iostream manipulators like std::put_time().
>
> The future we want is narrow literal encoding of UTF-8, with the streambuf
> receiving character sequences in UTF-8, with the final encoding produced by
> the codecvt facet in the streambuf.
That is already the case when the locale is "C" or uses UTF-8. When
neither of those is the case, this goal is already not achievable
exactly because iostreams implicitly uses the locale.
Applications on Windows that want to override the environment locale
encoding and always use UTF-8, can do so with a call to
std::locale::global(std::locale(".utf8")) and imbuing that local in
std::cout and std::cerr.
>
> The locale categories in that future determine the month names, but not
> their encoding.
We can't get there without breaking existing code.
>
> I don't quite know how we get there, but I'm pretty sure transcoding UTF-8
> to Shift-JIS in the inserters isn't how.
We have a good option for std::format(). For iostreams, the status quo
is that the programmer has to explicitly code their intent. If we want
to provide implicit behavior, we can't ignore the locale encoding
concerns without introducing inconsistency in the library.
Tom.
Received on 2024-05-04 04:41:49