ISOCPP sg16 List: Re: [isocpp-sg16] Follow up on SG16 review of P2996R2 (Reflection for C++26)

From: Victor Zverovich <victor.zverovich_at_[hidden]>
Date: Wed, 8 May 2024 09:54:11 -0700

> The ASCII and EBCDIC code page based locale programming model used on
POSIX and Windows systems is not broken.

It is actually broken on Windows for reasons explained in
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p2093r14.html.

- Vicrtor

On Fri, May 3, 2024 at 1:11 PM Tom Honermann via SG16 <sg16_at_[hidden]>
wrote:

> On 5/2/24 6:35 PM, Corentin Jabot via SG16 wrote:
>
>
>
> On Thu, May 2, 2024 at 11:25 PM Tom Honermann <tom_at_[hidden]> wrote:
>
>> On 4/30/24 2:31 AM, Corentin Jabot via SG16 wrote:
>>
>>
>>
>> On Tue, Apr 30, 2024 at 12:45 AM Tom Honermann <tom_at_[hidden]> wrote:
>>
>>> On 4/29/24 4:11 PM, Peter Dimov via SG16 wrote:
>>> > Tom Honermann wrote:
>>> >> I'm not entirely sure that cout << std::format("{}", u8"...") is
>>> that much
>>> >> easier
>>> >> to specify and support.
>>> >>
>>> >> But I'll be glad to be proven wrong, of course. :-)
>>> >>
>>> >> There is a relevant SO comment
>>> >> <https://stackoverflow.com/questions/58878651/what-is-the-printf-
>>> >> formatting-character-for-char8-t/58895428#58895428> .
>>> >>
>>> >> std::format() and std::print(), to some extent, improve the
>>> likelihood that an
>>> >> implementation selected encoding will be a good match for the
>>> programmer's
>>> >> intent. This is because:
>>> >>
>>> >> 1. std::format() and std::print() are not implicitly locale
>>> dependent; that
>>> >> rules out selection of a locale dependent execution encoding.
>>> >> 2. std::format() returns a std::string; that rules out selection of
>>> an I/O
>>> >> dependent encoding.
>>> >> 3. std::print() writes to an I/O stream, but has special behavior
>>> for writes
>>> >> to a terminal; that rules out selection of a terminal encoding (as
>>> unnecessary,
>>> >> at least in important cases).
>>> >> 4. std::format() and std::print() are both strongly associated with
>>> the
>>> >> ordinary/wide literal encoding.
>>> >> 5. std::format() and std::print() should have the same behavior
>>> (other than
>>> >> that std::print(...) may produce a better result than std::cout <<
>>> >> std::format(...) when the output is directed to a terminal).
>>> >> 6. std::format() and std::print() have additional guarantees when
>>> the
>>> >> ordinary/wide literal encoding is a UTF encoding.
>>> >>
>>> >>
>>> >> Due to those characteristics, we have good motivation for implicit
>>> use of the
>>> >> ordinary/wide literal encoding as the target for transcoding for
>>> std::format()
>>> >> and std::print().
>>> > I'm afraid that I don't quite understand.
>>> >
>>> > What does std::format( "{}", u8"..." ) actually do? I suppose it
>>> transcodes
>>> > the UTF-8 input into the narrow literal encoding (replacing
>>> irrepresentable
>>> > characters with '?' instead of throwing, I presume, or it would be not
>>> very
>>> > usable)?
>>>
>>> We'll have to see what Corentin proposes :)
>>>
>>> But yes, something very much like that.
>>>
>>> Note that we could also support std::format("{:L}", u8"...") to enable a
>>> programmer to explicitly request transcoding to a locale dependent
>>> encoding (either now or at some future point).
>>>
>>> (Corentin, at a minimum, we should reserve the L option in your paper).
>>>
>>
>> We have an opportunity to not conflate locale and encodings here.
>>
>> As much as I would like that to be the case, I don't think it is.
>>
>> u8"" is a known quantity here, it's utf-8.
>> But the target is also a known quantity, we very clearly decided it to be
>> the literal encoding, because we need to parse it, and
>> we wisely decided to assume a literal encoding. So the target encoding is
>> also a known quantity
>>
>> Unfortunately, that isn't the case when a programmer opts in to use of a
>> locale. Consider the following when the literal encoding is any ASCII
>> derived encoding and the global locale encoding is EUC-JP (ujis).
>>
>> #include <chrono>
>> #include <format>
>> #include <iostream>
>> #include <locale>
>> int main() {
>> std::locale::global(std::locale(""));
>> std::cout << std::format("{:L}\n", std::chrono::August);
>> }
>>
>> The resulting string will be formed from the literal encoding (for the
>> '\n' character) and the name of the month provided by the *formatting
>> locale <http://eel.is/c++draft/time.format#2>*. Nothing ensures that the
>> latter is converted to the literal encoding. Further, a validly encoded
>> string is produced so long as the characters used in the format string are
>> from the basic literal character set.
>>
>> In my environment (Linux, using a pre-release build of Clang 19 and
>> libc++), compiling the above with the default literal encoding (UTF-8) and
>> running it with LANG=ja_JP.ujis produces output in EUC-jp as expected;
>> note the iconv invocation.
>>
>> $ clang++ -std=c++23 -stdlib=libc++ t.cpp -o t
>> $ LANG=ja_JP.ujis ./t | iconv -f ujis -t utf-8
>> 8月
>>
>> (yes, that is the right output, it is convention for some translation of
>> month names to include the month number before the localized name).
>>
>> Long time SG16 participants will recall P2373R3 (Fixing locale handling
>> in chrono formatters) <https://wg21.link/p2372r3> and LWG 3547
>> <https://wg21.link/lwg3547>. There was relevant discussion during the 2021-04-28
>> SG16 meeting
>> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2021.md#april-28th-2021>
>> .
>>
>> I have vague recollections of discussions about requiring that locale
>> dependent translations be provided in the literal encoding when it is a UTF
>> one, but I haven't been able to identify any such recorded discussion. I
>> don't see anything in the current WP that would require this.
>>
>> Based on the above, I think that, at a minimum, the "L" option should be
>> reserved.
>>
>
> I'm not sure what you are arguing about because "L" can only be applied to
> things that can be "localized" (i.e. mangled horribly by POSIX).
>
> You are right that I wasn't very clear about what I'm suggesting. I'll try
> to clarify.
>
> The ASCII and EBCDIC code page based locale programming model used on
> POSIX and Windows systems is not broken. It does have sharp edges. Unicode
> and its associated encodings have enabled a new programming model with
> fewer constraints and pitfalls, but that has not completely displaced the
> code page based programming model nor do I think it ever will. The code
> page based programming model requires the following:
>
> 1. Since C and C++ programs start with the global locale set to "C",
> it is necessary to opt-in to locale dependent behavior by calling
> std::locale::global() and/or std::setlocale().
> 2. Such programs, in order to avoid mojibake, must constrain the use
> of compile-time selected characters encoded in the ordinary literal
> encoding to those that have an invariant representation in all supported
> locale dependent encodings
>
> There has been a lot of code written over the last 40 or so years that
> adheres to this model. Many such programs are effectively locale agnostic
> though full localization requires translations provided by message catalogs
> (that themselves rely on locale; GNU gettext
> <https://www.gnu.org/software/gettext/> and POSIX catopen
> <https://pubs.opengroup.org/onlinepubs/9699919799/functions/catopen.html>
> are relevant). In my opinion, these programs should continue to work and
> continue to benefit from C++ standard library enhancements.
>
> Let's look at that example from above again:
>
> std::cout << std::format("{:L}\n", std::chrono::August);
>
> Regardless of what the ordinary literal encoding is, if LANG is ja_JP.sjis,
> then valid Shift-JIS output will be produced. Likewise, if it is
> ja_JP.utf8, zh_CN.gb18030, or zh_TW.big5, valid output will be produced
> in those encodings. This is portable code that works on all platforms (with
> the right platform dependent locale names; those sadly are not portable).
>
> Let's now assume a hypothetical message catalog of translated strings that
> works similarly to gettext, but that provides UTF-8 encoded translations in
> char8_t.
>
> std::cout << std::format("{} {:L}\n", u8msg("In the month of"),
> std::chrono::August);
>
> If we unconditionally require the char8_t argument to be transcoded to
> the ordinary literal encoding, then mojibake will be produced unless the
> ordinary literal encoding happens to match the locale encoding.
>
> I strongly agree that, for std::format(), the default behavior should be
> that char8_t strings are transcoded to the ordinary literal encoding.
>
> What I am arguing for is that there should also be an option for the
> programmer to opt-in to locale based transcoding of arguments that
> potentially require transcoding. Thus:
>
> std::cout << std::format("{:L} {:L}\n", _u8("In the month of"),
> std::chrono::August);
>
> would portably produce correct locale dependent output (and transcoding
> would be reduced to a byte copy when the locale encoding is UTF-8).
>
> For the short term however, I'm content to just reserve the 'L' option;
> actually doing the work to support this can await further motivation and
> standard transcoding facilities.
>
> std::format("{:L}\n", ""); is ill-formed, so would be std::format("{:L}\n",
> u8"");
> https://eel.is/c++draft/format#string.std-17 (it's also used in chrono)
> https://compiler-explorer.com/z/58bsTaf3o
> Beside, reservation is not necessary, users cannot write formatters for
> types that do not depend on user-defined types (or, if you prefer, it's
> already reserved)
>
> Victor can correct me if I'm mistaken, but my understanding has been that
> changes to std::formatter specializations might cause (sometimes?) an ABI
> break. The following is recorded in the 2023-11-29 SG16 meeting
> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2023.md#november-29th-2023>
> notes from discussion of - P3045R0 (Quantities and units library)
> <https://wg21.link/p3045r0>.
>
> "Victor recommended reserving an 'L' option specifier in the format
> specification that would render the code ill-formed for now so as to allow
> extension later without an ABI break."
>
>
> But the issues with the whole scenario you are describing is that:
>
> 1.
> We keep trying to give meaning to programs where the execution encoding is
> not a superset of the literal encoding even though the encoding is
> generally not part of the type system
> So for 2 arbitrary strings a and b, concatenating them might not produce a
> good result, and we can't solve it.
>
> Such programs already have a meaning and have for the last 40 years or so.
> I agree there are sharp edges here that we can't fix.
>
> You will note that format makes the assumption that everything is in the
> literal encoding and it's working wonderfully well.
>
> std::format() does not ensure that the output produced is an any
> particular encoding. We spent a lot of time talking about whether
> std::format() produces text and eventually concluded that it is not
> required to do so.
>
> I am not arguing for a change in direction; in fact, I'm arguing with
> preserving consistent behavior with regard to its existing locale dependent
> behavior so that there is an option for *not* producing mojibake.
>
> It's certainly not perfect - i.e. we taught people to compile with /utf8
> on windows but the system is still not defaulting to UTF-8, but it's as
> good as we can reasonably get.
>
> Agreed. And for those that are able to use /utf8, that is great. I have
> no data, but I would bet a good deal of cash that the vast majority of code
> that is compiled with MSVC is not compiled with /utf-8.
>
>
> 2.
> When you ask for the name of August in Japanese, as a user, you probably
> don't expect part of your program to be encoded in some weird encoding that
> is different to the rest of the program.
> We try to patch that in format/chrono, but it's certainly not perfect
> https://eel.is/c++draft/time#format-3.sentence-3
>
> Thank you! That is the wording I was looking for with regard to my "vague
> recollections of discussions" statement above.
>
>
> Anyway, I'm not sure how that is relevant to the u8 discussion, L affects
> individual arguments, not the formatting string (the literal encoding is
> the ground truth for encoding as far as format is concerned)
>
> I hope the above better explains the relevance.
>
> Tom.
>
>
>
>
>
>
>
>>
>>
>>
>>>
>>> >
>>> > And then we just fall back to std::cout << "...", where the "..." is
>>> in the
>>> > narrow literal encoding and hence we assume works, more or less.
>>> Correct.
>>> >
>>> > And we don't want to make std::cout << u8"..." do that, because it can,
>>> > in principle, do better?
>>> Not because it can do better, but because there is more uncertainty
>>> about what the user might expect. If the user writes std::cout <<
>>> std::format(...), then that is an explicit opt in to the behavior that
>>> std::format() exhibits. But they might also want to just write UTF-8
>>> bytes unmodified regardless of what the ordinary literal encoding is. Or
>>> they might expect implicit transcoding to either the current locale or
>>> the environment locale or even the terminal locale. By not providing a
>>> default behavior, we give the programmer the opportunity to think about
>>> what they are actually trying to do.
>>>
>>
>> I don't quite buy this argument.
>> When cout << 42.0; outputs "42,0", the text nature, locale and encodings
>> were made for us.
>> If the programmer wants to be creative, one can consider io manipulators.
>>
>> Consider printing of other localized names as in the example above.
>>
>> #include <chrono>
>> #include <format>
>> #include <iostream>
>> #include <iomanip>
>> #include <locale>
>> int main() {
>> std::cout << "Default locale: '" << std::cout.getloc().name() << "'\n";
>> std::cout << std::chrono::August << "\n";
>> std::cout.imbue(std::locale(""));
>> std::cout << "Environment locale: '" << std::cout.getloc().name() <<
>> "'\n";
>> std::cout << std::chrono::August << "\n";
>> std::cout.imbue(std::locale("ja_JP.utf8"));
>> std::cout << "Explicit locale: '" << std::cout.getloc().name() << "'\n";
>> std::cout << std::chrono::August << "\n";
>> }
>>
>> I get the following output running that locally with LANG=ja_JP.ujis.
>> Note the mojibake and corresponding substitution of replacement characters.
>>
>> Default locale: 'C'
>> Aug
>> Environment locale: ''
>> 8��
>> Explicit locale: 'ja_JP.utf8'
>> 8月
>>
>> The (well recognized) problem with iostreams is the implicit use of the
>> imbued locale. The consistent behavior for iostreams would be that
>> inserters and extractors for charN_t would transcode to the encoding of
>> the imbued locale. But that doesn't work well at all in the common case
>> where no locale has been explicitly imbued.
>>
>> Making a choice for std::format() is simpler because the programmer
>> chooses the locale behavior on a per-argument basis; there is a good
>> default.
>>
>>
>>
>>> >
>>> > But let me get back to your list.
>>> >
>>> >> 1. std::format() and std::print() are not implicitly locale
>>> dependent; that
>>> >> rules out selection of a locale dependent execution encoding.
>>> > What is in a locale-dependent execution encoding in std::cout <<
>>> u8"..."?
>>> iostreams implicitly consults either an imbued locale facet or the
>>> global locale for formatting operations. Think about std::cout <<
>>> std::chrono::Sunday. Depending on the current locale, this might print
>>> "Sun" or a localized weekday name in a locale dependent encoding.
>>>
>>
>> But again, the only thing we care about for u8 is the encoding.
>> And I am not aware of std::locale ever impacting that.
>>
>> I hope the above examples are motivating.
>>
>>
>>
>>> >
>>> >> 2. std::format() returns a std::string; that rules out selection of
>>> an I/O
>>> >> dependent encoding.
>>> > Same question. Where is the I/O dependent encoding in std::cout <<
>>> u8"..."
>>> > (that is not also present in std::cout << some_std_string)?
>>> In the latter case, we have to assume that some_std_string holds text in
>>> the encoding expected on the other end of the stream. We can't do that
>>> for u8"...", so we have to transcode to something (or have some other
>>> assurance that UTF-8 is intended and expected).
>>> >
>>> >> 3. std::print() writes to an I/O stream, but has special behavior
>>> for writes
>>> >> to a terminal; that rules out selection of a terminal encoding (as
>>> unnecessary,
>>> >> at least in important cases).
>>
>> > This doesn't apply here, because we're using std::format.
>>>
>>
>> Right, this is one of the reasons I feel less compelled to pursue
>> iostream surgery.
>> Output behavior is suboptimal on windows, and unlikely to be fixed.
>>
>> I am likewise not compelled to pursue iostream support.
>>
>> Agreed with later remarks below.
>>
>> Tom.
>>
>>
>>
>>> >> 5. std::format() and std::print() should have the same behavior
>>> (other than
>>> >> that std::print(...) may produce a better result than std::cout <<
>>> >> std::format(...) when the output is directed to a terminal).
>>> > OK... but this isn't relevant.
>>> The above two are relevant because we wouldn't want to differentiate
>>> behavior for formatting a u8"..." argument for std::format() vs
>>> std::print(). The latter helps to constrain the reasonable options for
>>> the former.
>>
>>
>> Right, print just does format and output the result
>>
>>
>>> >
>>> >> 6. std::format() and std::print() have additional guarantees when
>>> the
>>> >> ordinary/wide literal encoding is a UTF encoding.
>>> > What additional guarantees, and how do they help here?
>>>
>>> We specify additional constraints for fill characters, display width
>>> (well, normative encouragement), and formatting of escaped strings. None
>>> of these are relevant for reflection purposes; they help to reinforce a
>>> choice to depend on the ordinary/wide literal encoding for behavior of
>>> these functions. We don't have such precedent for iostreams.
>>>
>>
>> And you know, the format string is parsed in the ordinary encoding and
>> copied as-it
>>
>>
>>>
>>> Tom.
>>>
>>>
>>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2024-05-08 16:54:28