ISOCPP sg16 List: Re: [isocpp-sg16] Follow up on SG16 review of P2996R2 (Reflection for C++26)

From: Victor Zverovich <victor.zverovich_at_[hidden]>
Date: Wed, 8 May 2024 09:48:30 -0700

> I have vague recollections of discussions about requiring that locale
dependent translations be provided in the literal encoding when it is a UTF
one, but I haven't been able to identify any such recorded discussion. I
don't see anything in the current WP that would require this.

https://eel.is/c++draft/time.format#3 is the relevant part of the standard.
See also
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p2419r2.html. The
locale output should be transcoded to the literal encoded (at least in the
common case of UTF-8).

HTH,
Victor

On Thu, May 2, 2024 at 2:25 PM Tom Honermann via SG16 <sg16_at_[hidden]>
wrote:

> On 4/30/24 2:31 AM, Corentin Jabot via SG16 wrote:
>
>
>
> On Tue, Apr 30, 2024 at 12:45 AM Tom Honermann <tom_at_[hidden]> wrote:
>
>> On 4/29/24 4:11 PM, Peter Dimov via SG16 wrote:
>> > Tom Honermann wrote:
>> >> I'm not entirely sure that cout << std::format("{}", u8"...") is
>> that much
>> >> easier
>> >> to specify and support.
>> >>
>> >> But I'll be glad to be proven wrong, of course. :-)
>> >>
>> >> There is a relevant SO comment
>> >> <https://stackoverflow.com/questions/58878651/what-is-the-printf-
>> >> formatting-character-for-char8-t/58895428#58895428> .
>> >>
>> >> std::format() and std::print(), to some extent, improve the likelihood
>> that an
>> >> implementation selected encoding will be a good match for the
>> programmer's
>> >> intent. This is because:
>> >>
>> >> 1. std::format() and std::print() are not implicitly locale
>> dependent; that
>> >> rules out selection of a locale dependent execution encoding.
>> >> 2. std::format() returns a std::string; that rules out selection of
>> an I/O
>> >> dependent encoding.
>> >> 3. std::print() writes to an I/O stream, but has special behavior
>> for writes
>> >> to a terminal; that rules out selection of a terminal encoding (as
>> unnecessary,
>> >> at least in important cases).
>> >> 4. std::format() and std::print() are both strongly associated with
>> the
>> >> ordinary/wide literal encoding.
>> >> 5. std::format() and std::print() should have the same behavior
>> (other than
>> >> that std::print(...) may produce a better result than std::cout <<
>> >> std::format(...) when the output is directed to a terminal).
>> >> 6. std::format() and std::print() have additional guarantees when the
>> >> ordinary/wide literal encoding is a UTF encoding.
>> >>
>> >>
>> >> Due to those characteristics, we have good motivation for implicit use
>> of the
>> >> ordinary/wide literal encoding as the target for transcoding for
>> std::format()
>> >> and std::print().
>> > I'm afraid that I don't quite understand.
>> >
>> > What does std::format( "{}", u8"..." ) actually do? I suppose it
>> transcodes
>> > the UTF-8 input into the narrow literal encoding (replacing
>> irrepresentable
>> > characters with '?' instead of throwing, I presume, or it would be not
>> very
>> > usable)?
>>
>> We'll have to see what Corentin proposes :)
>>
>> But yes, something very much like that.
>>
>> Note that we could also support std::format("{:L}", u8"...") to enable a
>> programmer to explicitly request transcoding to a locale dependent
>> encoding (either now or at some future point).
>>
>> (Corentin, at a minimum, we should reserve the L option in your paper).
>>
>
> We have an opportunity to not conflate locale and encodings here.
>
> As much as I would like that to be the case, I don't think it is.
>
> u8"" is a known quantity here, it's utf-8.
> But the target is also a known quantity, we very clearly decided it to be
> the literal encoding, because we need to parse it, and
> we wisely decided to assume a literal encoding. So the target encoding is
> also a known quantity
>
> Unfortunately, that isn't the case when a programmer opts in to use of a
> locale. Consider the following when the literal encoding is any ASCII
> derived encoding and the global locale encoding is EUC-JP (ujis).
>
> #include <chrono>
> #include <format>
> #include <iostream>
> #include <locale>
> int main() {
> std::locale::global(std::locale(""));
> std::cout << std::format("{:L}\n", std::chrono::August);
> }
>
> The resulting string will be formed from the literal encoding (for the
> '\n' character) and the name of the month provided by the *formatting
> locale <http://eel.is/c++draft/time.format#2>*. Nothing ensures that the
> latter is converted to the literal encoding. Further, a validly encoded
> string is produced so long as the characters used in the format string are
> from the basic literal character set.
>
> In my environment (Linux, using a pre-release build of Clang 19 and
> libc++), compiling the above with the default literal encoding (UTF-8) and
> running it with LANG=ja_JP.ujis produces output in EUC-jp as expected;
> note the iconv invocation.
>
> $ clang++ -std=c++23 -stdlib=libc++ t.cpp -o t
> $ LANG=ja_JP.ujis ./t | iconv -f ujis -t utf-8
> 8月
>
> (yes, that is the right output, it is convention for some translation of
> month names to include the month number before the localized name).
>
> Long time SG16 participants will recall P2373R3 (Fixing locale handling
> in chrono formatters) <https://wg21.link/p2372r3> and LWG 3547
> <https://wg21.link/lwg3547>. There was relevant discussion during the 2021-04-28
> SG16 meeting
> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2021.md#april-28th-2021>
> .
>
> I have vague recollections of discussions about requiring that locale
> dependent translations be provided in the literal encoding when it is a UTF
> one, but I haven't been able to identify any such recorded discussion. I
> don't see anything in the current WP that would require this.
>
> Based on the above, I think that, at a minimum, the "L" option should be
> reserved.
>
>
>
>
>
>>
>> >
>> > And then we just fall back to std::cout << "...", where the "..." is in
>> the
>> > narrow literal encoding and hence we assume works, more or less.
>> Correct.
>> >
>> > And we don't want to make std::cout << u8"..." do that, because it can,
>> > in principle, do better?
>> Not because it can do better, but because there is more uncertainty
>> about what the user might expect. If the user writes std::cout <<
>> std::format(...), then that is an explicit opt in to the behavior that
>> std::format() exhibits. But they might also want to just write UTF-8
>> bytes unmodified regardless of what the ordinary literal encoding is. Or
>> they might expect implicit transcoding to either the current locale or
>> the environment locale or even the terminal locale. By not providing a
>> default behavior, we give the programmer the opportunity to think about
>> what they are actually trying to do.
>>
>
> I don't quite buy this argument.
> When cout << 42.0; outputs "42,0", the text nature, locale and encodings
> were made for us.
> If the programmer wants to be creative, one can consider io manipulators.
>
> Consider printing of other localized names as in the example above.
>
> #include <chrono>
> #include <format>
> #include <iostream>
> #include <iomanip>
> #include <locale>
> int main() {
> std::cout << "Default locale: '" << std::cout.getloc().name() << "'\n";
> std::cout << std::chrono::August << "\n";
> std::cout.imbue(std::locale(""));
> std::cout << "Environment locale: '" << std::cout.getloc().name() <<
> "'\n";
> std::cout << std::chrono::August << "\n";
> std::cout.imbue(std::locale("ja_JP.utf8"));
> std::cout << "Explicit locale: '" << std::cout.getloc().name() << "'\n";
> std::cout << std::chrono::August << "\n";
> }
>
> I get the following output running that locally with LANG=ja_JP.ujis.
> Note the mojibake and corresponding substitution of replacement characters.
>
> Default locale: 'C'
> Aug
> Environment locale: ''
> 8��
> Explicit locale: 'ja_JP.utf8'
> 8月
>
> The (well recognized) problem with iostreams is the implicit use of the
> imbued locale. The consistent behavior for iostreams would be that
> inserters and extractors for charN_t would transcode to the encoding of
> the imbued locale. But that doesn't work well at all in the common case
> where no locale has been explicitly imbued.
>
> Making a choice for std::format() is simpler because the programmer
> chooses the locale behavior on a per-argument basis; there is a good
> default.
>
>
>
>> >
>> > But let me get back to your list.
>> >
>> >> 1. std::format() and std::print() are not implicitly locale
>> dependent; that
>> >> rules out selection of a locale dependent execution encoding.
>> > What is in a locale-dependent execution encoding in std::cout <<
>> u8"..."?
>> iostreams implicitly consults either an imbued locale facet or the
>> global locale for formatting operations. Think about std::cout <<
>> std::chrono::Sunday. Depending on the current locale, this might print
>> "Sun" or a localized weekday name in a locale dependent encoding.
>>
>
> But again, the only thing we care about for u8 is the encoding.
> And I am not aware of std::locale ever impacting that.
>
> I hope the above examples are motivating.
>
>
>
>> >
>> >> 2. std::format() returns a std::string; that rules out selection of
>> an I/O
>> >> dependent encoding.
>> > Same question. Where is the I/O dependent encoding in std::cout <<
>> u8"..."
>> > (that is not also present in std::cout << some_std_string)?
>> In the latter case, we have to assume that some_std_string holds text in
>> the encoding expected on the other end of the stream. We can't do that
>> for u8"...", so we have to transcode to something (or have some other
>> assurance that UTF-8 is intended and expected).
>> >
>> >> 3. std::print() writes to an I/O stream, but has special behavior
>> for writes
>> >> to a terminal; that rules out selection of a terminal encoding (as
>> unnecessary,
>> >> at least in important cases).
>
> > This doesn't apply here, because we're using std::format.
>>
>
> Right, this is one of the reasons I feel less compelled to pursue iostream
> surgery.
> Output behavior is suboptimal on windows, and unlikely to be fixed.
>
> I am likewise not compelled to pursue iostream support.
>
> Agreed with later remarks below.
>
> Tom.
>
>
>
>> >> 5. std::format() and std::print() should have the same behavior
>> (other than
>> >> that std::print(...) may produce a better result than std::cout <<
>> >> std::format(...) when the output is directed to a terminal).
>> > OK... but this isn't relevant.
>> The above two are relevant because we wouldn't want to differentiate
>> behavior for formatting a u8"..." argument for std::format() vs
>> std::print(). The latter helps to constrain the reasonable options for
>> the former.
>
>
> Right, print just does format and output the result
>
>
>> >
>> >> 6. std::format() and std::print() have additional guarantees when the
>> >> ordinary/wide literal encoding is a UTF encoding.
>> > What additional guarantees, and how do they help here?
>>
>> We specify additional constraints for fill characters, display width
>> (well, normative encouragement), and formatting of escaped strings. None
>> of these are relevant for reflection purposes; they help to reinforce a
>> choice to depend on the ordinary/wide literal encoding for behavior of
>> these functions. We don't have such precedent for iostreams.
>>
>
> And you know, the format string is parsed in the ordinary encoding and
> copied as-it
>
>
>>
>> Tom.
>>
>>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2024-05-08 16:48:44