C++ Logo

sg16

Advanced search

Re: [isocpp-sg16] Follow up on SG16 review of P2996R2 (Reflection for C++26)

From: Victor Zverovich <victor.zverovich_at_[hidden]>
Date: Wed, 8 May 2024 11:11:28 -0700
What you call a "code page programming model" is broken because there is no
single code page on Windows, there is a collection of incompatible code
pages that can change at runtime. The example demonstrated in the paper
shows this clearly: you get mojibake even in the ideal case where all code
pages are static and correspond to a single localization, not some
imaginary case where a user changed terminal encoding in an incompatible
way that you give. This is the reason why modern output facilities such the
one in Rust and std::print in C++23 avoid code pages completely. It's just
impossible to make work in principle.

You can ignore the non-Cyrillic part of the message, it is not relevant to
the problem. It has nothing to do with UTF-8 specifically, the same is true
for legacy encodings.

- Victor

On Wed, May 8, 2024 at 10:38 AM Tom Honermann <tom_at_[hidden]> wrote:

> On 5/8/24 12:54 PM, Victor Zverovich wrote:
>
> > The ASCII and EBCDIC code page based locale programming model used on
> POSIX and Windows systems is not broken.
>
> It is actually broken on Windows for reasons explained in
> https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p2093r14.html.
>
> I'm not sure what you are referring to as broken. If you are referring to
> characters not being displayed correctly in the Windows console due to the
> console using a code page that is not aligned with the locale/environment
> by default (because of backward compatibility with old DOS applications),
> then yes, that is broken, but it is broken due to the inconsistent encoding
> selection, not due to the code page based programming model. The same
> behavior would be exhibited on Linux if the terminal encoding was changed
> to CP437.
>
> Note that the motivating example in section 9 of that paper, std::cout <<
> "Привет, κόσμος!";, violates the code page paged programming model by
> using characters in the string literal that do not have an invariant
> representation in all supported locales. Further, since the example
> explicitly specifies the use of UTF-8, the example is limited to locales
> that use UTF-8 as their encoding. There is nothing wrong with that code of
> course, it just requires a UTF-8 based programming model.
>
> Tom.
>
>
> - Vicrtor
>
> On Fri, May 3, 2024 at 1:11 PM Tom Honermann via SG16 <
> sg16_at_[hidden]> wrote:
>
>> On 5/2/24 6:35 PM, Corentin Jabot via SG16 wrote:
>>
>>
>>
>> On Thu, May 2, 2024 at 11:25 PM Tom Honermann <tom_at_[hidden]> wrote:
>>
>>> On 4/30/24 2:31 AM, Corentin Jabot via SG16 wrote:
>>>
>>>
>>>
>>> On Tue, Apr 30, 2024 at 12:45 AM Tom Honermann <tom_at_[hidden]>
>>> wrote:
>>>
>>>> On 4/29/24 4:11 PM, Peter Dimov via SG16 wrote:
>>>> > Tom Honermann wrote:
>>>> >> I'm not entirely sure that cout << std::format("{}", u8"...")
>>>> is that much
>>>> >> easier
>>>> >> to specify and support.
>>>> >>
>>>> >> But I'll be glad to be proven wrong, of course. :-)
>>>> >>
>>>> >> There is a relevant SO comment
>>>> >> <https://stackoverflow.com/questions/58878651/what-is-the-printf-
>>>> >> formatting-character-for-char8-t/58895428#58895428> .
>>>> >>
>>>> >> std::format() and std::print(), to some extent, improve the
>>>> likelihood that an
>>>> >> implementation selected encoding will be a good match for the
>>>> programmer's
>>>> >> intent. This is because:
>>>> >>
>>>> >> 1. std::format() and std::print() are not implicitly locale
>>>> dependent; that
>>>> >> rules out selection of a locale dependent execution encoding.
>>>> >> 2. std::format() returns a std::string; that rules out selection
>>>> of an I/O
>>>> >> dependent encoding.
>>>> >> 3. std::print() writes to an I/O stream, but has special behavior
>>>> for writes
>>>> >> to a terminal; that rules out selection of a terminal encoding (as
>>>> unnecessary,
>>>> >> at least in important cases).
>>>> >> 4. std::format() and std::print() are both strongly associated
>>>> with the
>>>> >> ordinary/wide literal encoding.
>>>> >> 5. std::format() and std::print() should have the same behavior
>>>> (other than
>>>> >> that std::print(...) may produce a better result than std::cout <<
>>>> >> std::format(...) when the output is directed to a terminal).
>>>> >> 6. std::format() and std::print() have additional guarantees when
>>>> the
>>>> >> ordinary/wide literal encoding is a UTF encoding.
>>>> >>
>>>> >>
>>>> >> Due to those characteristics, we have good motivation for implicit
>>>> use of the
>>>> >> ordinary/wide literal encoding as the target for transcoding for
>>>> std::format()
>>>> >> and std::print().
>>>> > I'm afraid that I don't quite understand.
>>>> >
>>>> > What does std::format( "{}", u8"..." ) actually do? I suppose it
>>>> transcodes
>>>> > the UTF-8 input into the narrow literal encoding (replacing
>>>> irrepresentable
>>>> > characters with '?' instead of throwing, I presume, or it would be
>>>> not very
>>>> > usable)?
>>>>
>>>> We'll have to see what Corentin proposes :)
>>>>
>>>> But yes, something very much like that.
>>>>
>>>> Note that we could also support std::format("{:L}", u8"...") to enable
>>>> a
>>>> programmer to explicitly request transcoding to a locale dependent
>>>> encoding (either now or at some future point).
>>>>
>>>> (Corentin, at a minimum, we should reserve the L option in your paper).
>>>>
>>>
>>> We have an opportunity to not conflate locale and encodings here.
>>>
>>> As much as I would like that to be the case, I don't think it is.
>>>
>>> u8"" is a known quantity here, it's utf-8.
>>> But the target is also a known quantity, we very clearly decided it to
>>> be the literal encoding, because we need to parse it, and
>>> we wisely decided to assume a literal encoding. So the target encoding
>>> is also a known quantity
>>>
>>> Unfortunately, that isn't the case when a programmer opts in to use of a
>>> locale. Consider the following when the literal encoding is any ASCII
>>> derived encoding and the global locale encoding is EUC-JP (ujis).
>>>
>>> #include <chrono>
>>> #include <format>
>>> #include <iostream>
>>> #include <locale>
>>> int main() {
>>> std::locale::global(std::locale(""));
>>> std::cout << std::format("{:L}\n", std::chrono::August);
>>> }
>>>
>>> The resulting string will be formed from the literal encoding (for the
>>> '\n' character) and the name of the month provided by the *formatting
>>> locale <http://eel.is/c++draft/time.format#2>*. Nothing ensures that
>>> the latter is converted to the literal encoding. Further, a validly encoded
>>> string is produced so long as the characters used in the format string are
>>> from the basic literal character set.
>>>
>>> In my environment (Linux, using a pre-release build of Clang 19 and
>>> libc++), compiling the above with the default literal encoding (UTF-8) and
>>> running it with LANG=ja_JP.ujis produces output in EUC-jp as expected;
>>> note the iconv invocation.
>>>
>>> $ clang++ -std=c++23 -stdlib=libc++ t.cpp -o t
>>> $ LANG=ja_JP.ujis ./t | iconv -f ujis -t utf-8
>>> 8月
>>>
>>> (yes, that is the right output, it is convention for some translation of
>>> month names to include the month number before the localized name).
>>>
>>> Long time SG16 participants will recall P2373R3 (Fixing locale handling
>>> in chrono formatters) <https://wg21.link/p2372r3> and LWG 3547
>>> <https://wg21.link/lwg3547>. There was relevant discussion during the 2021-04-28
>>> SG16 meeting
>>> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2021.md#april-28th-2021>
>>> .
>>>
>>> I have vague recollections of discussions about requiring that locale
>>> dependent translations be provided in the literal encoding when it is a UTF
>>> one, but I haven't been able to identify any such recorded discussion. I
>>> don't see anything in the current WP that would require this.
>>>
>>> Based on the above, I think that, at a minimum, the "L" option should be
>>> reserved.
>>>
>>
>> I'm not sure what you are arguing about because "L" can only be applied
>> to things that can be "localized" (i.e. mangled horribly by POSIX).
>>
>> You are right that I wasn't very clear about what I'm suggesting. I'll
>> try to clarify.
>>
>> The ASCII and EBCDIC code page based locale programming model used on
>> POSIX and Windows systems is not broken. It does have sharp edges. Unicode
>> and its associated encodings have enabled a new programming model with
>> fewer constraints and pitfalls, but that has not completely displaced the
>> code page based programming model nor do I think it ever will. The code
>> page based programming model requires the following:
>>
>> 1. Since C and C++ programs start with the global locale set to "C",
>> it is necessary to opt-in to locale dependent behavior by calling
>> std::locale::global() and/or std::setlocale().
>> 2. Such programs, in order to avoid mojibake, must constrain the use
>> of compile-time selected characters encoded in the ordinary literal
>> encoding to those that have an invariant representation in all supported
>> locale dependent encodings
>>
>> There has been a lot of code written over the last 40 or so years that
>> adheres to this model. Many such programs are effectively locale agnostic
>> though full localization requires translations provided by message catalogs
>> (that themselves rely on locale; GNU gettext
>> <https://www.gnu.org/software/gettext/> and POSIX catopen
>> <https://pubs.opengroup.org/onlinepubs/9699919799/functions/catopen.html>
>> are relevant). In my opinion, these programs should continue to work and
>> continue to benefit from C++ standard library enhancements.
>>
>> Let's look at that example from above again:
>>
>> std::cout << std::format("{:L}\n", std::chrono::August);
>>
>> Regardless of what the ordinary literal encoding is, if LANG is
>> ja_JP.sjis, then valid Shift-JIS output will be produced. Likewise, if
>> it is ja_JP.utf8, zh_CN.gb18030, or zh_TW.big5, valid output will be
>> produced in those encodings. This is portable code that works on all
>> platforms (with the right platform dependent locale names; those sadly are
>> not portable).
>>
>> Let's now assume a hypothetical message catalog of translated strings
>> that works similarly to gettext, but that provides UTF-8 encoded
>> translations in char8_t.
>>
>> std::cout << std::format("{} {:L}\n", u8msg("In the month of"),
>> std::chrono::August);
>>
>> If we unconditionally require the char8_t argument to be transcoded to
>> the ordinary literal encoding, then mojibake will be produced unless the
>> ordinary literal encoding happens to match the locale encoding.
>>
>> I strongly agree that, for std::format(), the default behavior should be
>> that char8_t strings are transcoded to the ordinary literal encoding.
>>
>> What I am arguing for is that there should also be an option for the
>> programmer to opt-in to locale based transcoding of arguments that
>> potentially require transcoding. Thus:
>>
>> std::cout << std::format("{:L} {:L}\n", _u8("In the month of"),
>> std::chrono::August);
>>
>> would portably produce correct locale dependent output (and transcoding
>> would be reduced to a byte copy when the locale encoding is UTF-8).
>>
>> For the short term however, I'm content to just reserve the 'L' option;
>> actually doing the work to support this can await further motivation and
>> standard transcoding facilities.
>>
>> std::format("{:L}\n", ""); is ill-formed, so would be std::format("{:L}\n",
>> u8"");
>> https://eel.is/c++draft/format#string.std-17 (it's also used in chrono)
>> https://compiler-explorer.com/z/58bsTaf3o
>> Beside, reservation is not necessary, users cannot write formatters for
>> types that do not depend on user-defined types (or, if you prefer, it's
>> already reserved)
>>
>> Victor can correct me if I'm mistaken, but my understanding has been that
>> changes to std::formatter specializations might cause (sometimes?) an
>> ABI break. The following is recorded in the 2023-11-29 SG16 meeting
>> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2023.md#november-29th-2023>
>> notes from discussion of - P3045R0 (Quantities and units library)
>> <https://wg21.link/p3045r0>.
>>
>> "Victor recommended reserving an 'L' option specifier in the format
>> specification that would render the code ill-formed for now so as to allow
>> extension later without an ABI break."
>>
>>
>> But the issues with the whole scenario you are describing is that:
>>
>> 1.
>> We keep trying to give meaning to programs where the execution encoding
>> is not a superset of the literal encoding even though the encoding is
>> generally not part of the type system
>> So for 2 arbitrary strings a and b, concatenating them might not produce
>> a good result, and we can't solve it.
>>
>> Such programs already have a meaning and have for the last 40 years or
>> so. I agree there are sharp edges here that we can't fix.
>>
>> You will note that format makes the assumption that everything is in the
>> literal encoding and it's working wonderfully well.
>>
>> std::format() does not ensure that the output produced is an any
>> particular encoding. We spent a lot of time talking about whether
>> std::format() produces text and eventually concluded that it is not
>> required to do so.
>>
>> I am not arguing for a change in direction; in fact, I'm arguing with
>> preserving consistent behavior with regard to its existing locale dependent
>> behavior so that there is an option for *not* producing mojibake.
>>
>> It's certainly not perfect - i.e. we taught people to compile with /utf8
>> on windows but the system is still not defaulting to UTF-8, but it's as
>> good as we can reasonably get.
>>
>> Agreed. And for those that are able to use /utf8, that is great. I have
>> no data, but I would bet a good deal of cash that the vast majority of code
>> that is compiled with MSVC is not compiled with /utf-8.
>>
>>
>> 2.
>> When you ask for the name of August in Japanese, as a user, you probably
>> don't expect part of your program to be encoded in some weird encoding that
>> is different to the rest of the program.
>> We try to patch that in format/chrono, but it's certainly not perfect
>> https://eel.is/c++draft/time#format-3.sentence-3
>>
>> Thank you! That is the wording I was looking for with regard to my "vague
>> recollections of discussions" statement above.
>>
>>
>> Anyway, I'm not sure how that is relevant to the u8 discussion, L affects
>> individual arguments, not the formatting string (the literal encoding is
>> the ground truth for encoding as far as format is concerned)
>>
>> I hope the above better explains the relevance.
>>
>> Tom.
>>
>>
>>
>>
>>
>>
>>
>>>
>>>
>>>
>>>>
>>>> >
>>>> > And then we just fall back to std::cout << "...", where the "..." is
>>>> in the
>>>> > narrow literal encoding and hence we assume works, more or less.
>>>> Correct.
>>>> >
>>>> > And we don't want to make std::cout << u8"..." do that, because it
>>>> can,
>>>> > in principle, do better?
>>>> Not because it can do better, but because there is more uncertainty
>>>> about what the user might expect. If the user writes std::cout <<
>>>> std::format(...), then that is an explicit opt in to the behavior that
>>>> std::format() exhibits. But they might also want to just write UTF-8
>>>> bytes unmodified regardless of what the ordinary literal encoding is.
>>>> Or
>>>> they might expect implicit transcoding to either the current locale or
>>>> the environment locale or even the terminal locale. By not providing a
>>>> default behavior, we give the programmer the opportunity to think about
>>>> what they are actually trying to do.
>>>>
>>>
>>> I don't quite buy this argument.
>>> When cout << 42.0; outputs "42,0", the text nature, locale and encodings
>>> were made for us.
>>> If the programmer wants to be creative, one can consider io manipulators.
>>>
>>> Consider printing of other localized names as in the example above.
>>>
>>> #include <chrono>
>>> #include <format>
>>> #include <iostream>
>>> #include <iomanip>
>>> #include <locale>
>>> int main() {
>>> std::cout << "Default locale: '" << std::cout.getloc().name() << "'\n";
>>> std::cout << std::chrono::August << "\n";
>>> std::cout.imbue(std::locale(""));
>>> std::cout << "Environment locale: '" << std::cout.getloc().name() <<
>>> "'\n";
>>> std::cout << std::chrono::August << "\n";
>>> std::cout.imbue(std::locale("ja_JP.utf8"));
>>> std::cout << "Explicit locale: '" << std::cout.getloc().name() <<
>>> "'\n";
>>> std::cout << std::chrono::August << "\n";
>>> }
>>>
>>> I get the following output running that locally with LANG=ja_JP.ujis.
>>> Note the mojibake and corresponding substitution of replacement characters.
>>>
>>> Default locale: 'C'
>>> Aug
>>> Environment locale: ''
>>> 8��
>>> Explicit locale: 'ja_JP.utf8'
>>> 8月
>>>
>>> The (well recognized) problem with iostreams is the implicit use of the
>>> imbued locale. The consistent behavior for iostreams would be that
>>> inserters and extractors for charN_t would transcode to the encoding of
>>> the imbued locale. But that doesn't work well at all in the common case
>>> where no locale has been explicitly imbued.
>>>
>>> Making a choice for std::format() is simpler because the programmer
>>> chooses the locale behavior on a per-argument basis; there is a good
>>> default.
>>>
>>>
>>>
>>>> >
>>>> > But let me get back to your list.
>>>> >
>>>> >> 1. std::format() and std::print() are not implicitly locale
>>>> dependent; that
>>>> >> rules out selection of a locale dependent execution encoding.
>>>> > What is in a locale-dependent execution encoding in std::cout <<
>>>> u8"..."?
>>>> iostreams implicitly consults either an imbued locale facet or the
>>>> global locale for formatting operations. Think about std::cout <<
>>>> std::chrono::Sunday. Depending on the current locale, this might print
>>>> "Sun" or a localized weekday name in a locale dependent encoding.
>>>>
>>>
>>> But again, the only thing we care about for u8 is the encoding.
>>> And I am not aware of std::locale ever impacting that.
>>>
>>> I hope the above examples are motivating.
>>>
>>>
>>>
>>>> >
>>>> >> 2. std::format() returns a std::string; that rules out selection
>>>> of an I/O
>>>> >> dependent encoding.
>>>> > Same question. Where is the I/O dependent encoding in std::cout <<
>>>> u8"..."
>>>> > (that is not also present in std::cout << some_std_string)?
>>>> In the latter case, we have to assume that some_std_string holds text
>>>> in
>>>> the encoding expected on the other end of the stream. We can't do that
>>>> for u8"...", so we have to transcode to something (or have some other
>>>> assurance that UTF-8 is intended and expected).
>>>> >
>>>> >> 3. std::print() writes to an I/O stream, but has special behavior
>>>> for writes
>>>> >> to a terminal; that rules out selection of a terminal encoding (as
>>>> unnecessary,
>>>> >> at least in important cases).
>>>
>>> > This doesn't apply here, because we're using std::format.
>>>>
>>>
>>> Right, this is one of the reasons I feel less compelled to pursue
>>> iostream surgery.
>>> Output behavior is suboptimal on windows, and unlikely to be fixed.
>>>
>>> I am likewise not compelled to pursue iostream support.
>>>
>>> Agreed with later remarks below.
>>>
>>> Tom.
>>>
>>>
>>>
>>>> >> 5. std::format() and std::print() should have the same behavior
>>>> (other than
>>>> >> that std::print(...) may produce a better result than std::cout <<
>>>> >> std::format(...) when the output is directed to a terminal).
>>>> > OK... but this isn't relevant.
>>>> The above two are relevant because we wouldn't want to differentiate
>>>> behavior for formatting a u8"..." argument for std::format() vs
>>>> std::print(). The latter helps to constrain the reasonable options for
>>>> the former.
>>>
>>>
>>> Right, print just does format and output the result
>>>
>>>
>>>> >
>>>> >> 6. std::format() and std::print() have additional guarantees when
>>>> the
>>>> >> ordinary/wide literal encoding is a UTF encoding.
>>>> > What additional guarantees, and how do they help here?
>>>>
>>>> We specify additional constraints for fill characters, display width
>>>> (well, normative encouragement), and formatting of escaped strings.
>>>> None
>>>> of these are relevant for reflection purposes; they help to reinforce a
>>>> choice to depend on the ordinary/wide literal encoding for behavior of
>>>> these functions. We don't have such precedent for iostreams.
>>>>
>>>
>>> And you know, the format string is parsed in the ordinary encoding and
>>> copied as-it
>>>
>>>
>>>>
>>>> Tom.
>>>>
>>>>
>>>
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>

Received on 2024-05-08 18:11:44