Date: Wed, 8 May 2024 20:53:02 +0200
On Wed, May 8, 2024 at 8:31 PM Tom Honermann via SG16 <sg16_at_[hidden]>
wrote:
> I think we have different ideas about what is meant by the "code page
> programming model". What I am referring to is exactly the programming model
> that is used to facilitate support for many code pages where one of them is
> selected at run-time based on the current locale. This is the programming
> model that has been in use since the introduction of C and C++. A new
> programming model was introduced by wchar_t and UTF-16/UTF-32. Much of
> the work that we have been doing (and which std::print() helps to
> facilitate) has been focused on supporting another new programming model;
> the locale independent UTF-8 one.
>
> If you take the example from the paper, compile it with the ordinary
> literal encoding set to Windows-1251, run it on a Windows machine with a
> region setting that uses Windows-1251, and then run chcp 1251 to "fix"
> the console encoding, then the correct output will be displayed. Those
> limitations exist because the example explicitly uses characters that
> limits its application to a specific legacy encoding (in the code pages
> world).
>
> If you take the example and replace the sequence of Cyrillic characters
> with a call to a message catalog that produces an appropriate translation
> based on the current locale, then the example can be compiled with the
> ordinary literal encoding set to any code page and it will run as expected
> on any Windows machine regardless of region setting. Depending on the
> characters present in the translation provided by the message catalog,
> users might still have to run chcp 12XX to "fix" the console encoding
> though since it is not set consistently with the region encoding. Windows
> resource strings, the POSIX message catalog, GNU gettext, etc... have all
> supported this programming model for the last 40 years or so. This history
> is why I emphatically assert that there isn't anything broken with this
> programming model. It has limitations, it has sharp edges, and we can all
> rejoice that support for other programming models has emerged, but that
> doesn't change the fact that there is a lot of code written to this model
> that is still being maintained.
>
I think we are all on the same (code) page regarding terminology.
That code pages have been and still are widely used is undeniable. They
might even be used with some amount of success in the western world and run
happily in millions of devices right now.
That this model is unfit to satisfy our needs and that of our users can
also, independently be true. (The main novelty being that our users found
this thing called the Internet and got addicted to it, so we can no longer
pretend that Cyrillic text and English text will never intermingle.
The code page model mathematically represents less information, it is
fundamentally incompatible with our goals.
>From there, I hope we can find the right line between not breaking existing
code, which we should definitely avoid doing willy nilly, and being victims
of sunk cost fallacies.
> Tom.
> On 5/8/24 2:11 PM, Victor Zverovich wrote:
>
> What you call a "code page programming model" is broken because there is
> no single code page on Windows, there is a collection of incompatible code
> pages that can change at runtime. The example demonstrated in the paper
> shows this clearly: you get mojibake even in the ideal case where all code
> pages are static and correspond to a single localization, not some
> imaginary case where a user changed terminal encoding in an incompatible
> way that you give. This is the reason why modern output facilities such the
> one in Rust and std::print in C++23 avoid code pages completely. It's just
> impossible to make work in principle.
>
> You can ignore the non-Cyrillic part of the message, it is not relevant to
> the problem. It has nothing to do with UTF-8 specifically, the same is true
> for legacy encodings.
>
> - Victor
>
> On Wed, May 8, 2024 at 10:38 AM Tom Honermann <tom_at_[hidden]> wrote:
>
>> On 5/8/24 12:54 PM, Victor Zverovich wrote:
>>
>> > The ASCII and EBCDIC code page based locale programming model used on
>> POSIX and Windows systems is not broken.
>>
>> It is actually broken on Windows for reasons explained in
>> https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p2093r14.html.
>>
>> I'm not sure what you are referring to as broken. If you are referring to
>> characters not being displayed correctly in the Windows console due to the
>> console using a code page that is not aligned with the locale/environment
>> by default (because of backward compatibility with old DOS applications),
>> then yes, that is broken, but it is broken due to the inconsistent encoding
>> selection, not due to the code page based programming model. The same
>> behavior would be exhibited on Linux if the terminal encoding was changed
>> to CP437.
>>
>> Note that the motivating example in section 9 of that paper, std::cout
>> << "Привет, κόσμος!";, violates the code page paged programming model by
>> using characters in the string literal that do not have an invariant
>> representation in all supported locales. Further, since the example
>> explicitly specifies the use of UTF-8, the example is limited to locales
>> that use UTF-8 as their encoding. There is nothing wrong with that code of
>> course, it just requires a UTF-8 based programming model.
>>
>> Tom.
>>
>>
>> - Vicrtor
>>
>> On Fri, May 3, 2024 at 1:11 PM Tom Honermann via SG16 <
>> sg16_at_[hidden]> wrote:
>>
>>> On 5/2/24 6:35 PM, Corentin Jabot via SG16 wrote:
>>>
>>>
>>>
>>> On Thu, May 2, 2024 at 11:25 PM Tom Honermann <tom_at_[hidden]> wrote:
>>>
>>>> On 4/30/24 2:31 AM, Corentin Jabot via SG16 wrote:
>>>>
>>>>
>>>>
>>>> On Tue, Apr 30, 2024 at 12:45 AM Tom Honermann <tom_at_[hidden]>
>>>> wrote:
>>>>
>>>>> On 4/29/24 4:11 PM, Peter Dimov via SG16 wrote:
>>>>> > Tom Honermann wrote:
>>>>> >> I'm not entirely sure that cout << std::format("{}", u8"...")
>>>>> is that much
>>>>> >> easier
>>>>> >> to specify and support.
>>>>> >>
>>>>> >> But I'll be glad to be proven wrong, of course. :-)
>>>>> >>
>>>>> >> There is a relevant SO comment
>>>>> >> <https://stackoverflow.com/questions/58878651/what-is-the-printf-
>>>>> >> formatting-character-for-char8-t/58895428#58895428> .
>>>>> >>
>>>>> >> std::format() and std::print(), to some extent, improve the
>>>>> likelihood that an
>>>>> >> implementation selected encoding will be a good match for the
>>>>> programmer's
>>>>> >> intent. This is because:
>>>>> >>
>>>>> >> 1. std::format() and std::print() are not implicitly locale
>>>>> dependent; that
>>>>> >> rules out selection of a locale dependent execution encoding.
>>>>> >> 2. std::format() returns a std::string; that rules out selection
>>>>> of an I/O
>>>>> >> dependent encoding.
>>>>> >> 3. std::print() writes to an I/O stream, but has special behavior
>>>>> for writes
>>>>> >> to a terminal; that rules out selection of a terminal encoding (as
>>>>> unnecessary,
>>>>> >> at least in important cases).
>>>>> >> 4. std::format() and std::print() are both strongly associated
>>>>> with the
>>>>> >> ordinary/wide literal encoding.
>>>>> >> 5. std::format() and std::print() should have the same behavior
>>>>> (other than
>>>>> >> that std::print(...) may produce a better result than std::cout <<
>>>>> >> std::format(...) when the output is directed to a terminal).
>>>>> >> 6. std::format() and std::print() have additional guarantees when
>>>>> the
>>>>> >> ordinary/wide literal encoding is a UTF encoding.
>>>>> >>
>>>>> >>
>>>>> >> Due to those characteristics, we have good motivation for implicit
>>>>> use of the
>>>>> >> ordinary/wide literal encoding as the target for transcoding for
>>>>> std::format()
>>>>> >> and std::print().
>>>>> > I'm afraid that I don't quite understand.
>>>>> >
>>>>> > What does std::format( "{}", u8"..." ) actually do? I suppose it
>>>>> transcodes
>>>>> > the UTF-8 input into the narrow literal encoding (replacing
>>>>> irrepresentable
>>>>> > characters with '?' instead of throwing, I presume, or it would be
>>>>> not very
>>>>> > usable)?
>>>>>
>>>>> We'll have to see what Corentin proposes :)
>>>>>
>>>>> But yes, something very much like that.
>>>>>
>>>>> Note that we could also support std::format("{:L}", u8"...") to enable
>>>>> a
>>>>> programmer to explicitly request transcoding to a locale dependent
>>>>> encoding (either now or at some future point).
>>>>>
>>>>> (Corentin, at a minimum, we should reserve the L option in your paper).
>>>>>
>>>>
>>>> We have an opportunity to not conflate locale and encodings here.
>>>>
>>>> As much as I would like that to be the case, I don't think it is.
>>>>
>>>> u8"" is a known quantity here, it's utf-8.
>>>> But the target is also a known quantity, we very clearly decided it to
>>>> be the literal encoding, because we need to parse it, and
>>>> we wisely decided to assume a literal encoding. So the target encoding
>>>> is also a known quantity
>>>>
>>>> Unfortunately, that isn't the case when a programmer opts in to use of
>>>> a locale. Consider the following when the literal encoding is any ASCII
>>>> derived encoding and the global locale encoding is EUC-JP (ujis).
>>>>
>>>> #include <chrono>
>>>> #include <format>
>>>> #include <iostream>
>>>> #include <locale>
>>>> int main() {
>>>> std::locale::global(std::locale(""));
>>>> std::cout << std::format("{:L}\n", std::chrono::August);
>>>> }
>>>>
>>>> The resulting string will be formed from the literal encoding (for the
>>>> '\n' character) and the name of the month provided by the *formatting
>>>> locale <http://eel.is/c++draft/time.format#2>*. Nothing ensures that
>>>> the latter is converted to the literal encoding. Further, a validly encoded
>>>> string is produced so long as the characters used in the format string are
>>>> from the basic literal character set.
>>>>
>>>> In my environment (Linux, using a pre-release build of Clang 19 and
>>>> libc++), compiling the above with the default literal encoding (UTF-8) and
>>>> running it with LANG=ja_JP.ujis produces output in EUC-jp as expected;
>>>> note the iconv invocation.
>>>>
>>>> $ clang++ -std=c++23 -stdlib=libc++ t.cpp -o t
>>>> $ LANG=ja_JP.ujis ./t | iconv -f ujis -t utf-8
>>>> 8月
>>>>
>>>> (yes, that is the right output, it is convention for some translation
>>>> of month names to include the month number before the localized name).
>>>>
>>>> Long time SG16 participants will recall P2373R3 (Fixing locale
>>>> handling in chrono formatters) <https://wg21.link/p2372r3> and LWG 3547
>>>> <https://wg21.link/lwg3547>. There was relevant discussion during the 2021-04-28
>>>> SG16 meeting
>>>> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2021.md#april-28th-2021>
>>>> .
>>>>
>>>> I have vague recollections of discussions about requiring that locale
>>>> dependent translations be provided in the literal encoding when it is a UTF
>>>> one, but I haven't been able to identify any such recorded discussion. I
>>>> don't see anything in the current WP that would require this.
>>>>
>>>> Based on the above, I think that, at a minimum, the "L" option should
>>>> be reserved.
>>>>
>>>
>>> I'm not sure what you are arguing about because "L" can only be applied
>>> to things that can be "localized" (i.e. mangled horribly by POSIX).
>>>
>>> You are right that I wasn't very clear about what I'm suggesting. I'll
>>> try to clarify.
>>>
>>> The ASCII and EBCDIC code page based locale programming model used on
>>> POSIX and Windows systems is not broken. It does have sharp edges. Unicode
>>> and its associated encodings have enabled a new programming model with
>>> fewer constraints and pitfalls, but that has not completely displaced the
>>> code page based programming model nor do I think it ever will. The code
>>> page based programming model requires the following:
>>>
>>> 1. Since C and C++ programs start with the global locale set to "C",
>>> it is necessary to opt-in to locale dependent behavior by calling
>>> std::locale::global() and/or std::setlocale().
>>> 2. Such programs, in order to avoid mojibake, must constrain the use
>>> of compile-time selected characters encoded in the ordinary literal
>>> encoding to those that have an invariant representation in all supported
>>> locale dependent encodings
>>>
>>> There has been a lot of code written over the last 40 or so years that
>>> adheres to this model. Many such programs are effectively locale agnostic
>>> though full localization requires translations provided by message catalogs
>>> (that themselves rely on locale; GNU gettext
>>> <https://www.gnu.org/software/gettext/> and POSIX catopen
>>> <https://pubs.opengroup.org/onlinepubs/9699919799/functions/catopen.html>
>>> are relevant). In my opinion, these programs should continue to work and
>>> continue to benefit from C++ standard library enhancements.
>>>
>>> Let's look at that example from above again:
>>>
>>> std::cout << std::format("{:L}\n", std::chrono::August);
>>>
>>> Regardless of what the ordinary literal encoding is, if LANG is
>>> ja_JP.sjis, then valid Shift-JIS output will be produced. Likewise, if
>>> it is ja_JP.utf8, zh_CN.gb18030, or zh_TW.big5, valid output will be
>>> produced in those encodings. This is portable code that works on all
>>> platforms (with the right platform dependent locale names; those sadly are
>>> not portable).
>>>
>>> Let's now assume a hypothetical message catalog of translated strings
>>> that works similarly to gettext, but that provides UTF-8 encoded
>>> translations in char8_t.
>>>
>>> std::cout << std::format("{} {:L}\n", u8msg("In the month of"),
>>> std::chrono::August);
>>>
>>> If we unconditionally require the char8_t argument to be transcoded to
>>> the ordinary literal encoding, then mojibake will be produced unless the
>>> ordinary literal encoding happens to match the locale encoding.
>>>
>>> I strongly agree that, for std::format(), the default behavior should
>>> be that char8_t strings are transcoded to the ordinary literal encoding.
>>>
>>> What I am arguing for is that there should also be an option for the
>>> programmer to opt-in to locale based transcoding of arguments that
>>> potentially require transcoding. Thus:
>>>
>>> std::cout << std::format("{:L} {:L}\n", _u8("In the month of"),
>>> std::chrono::August);
>>>
>>> would portably produce correct locale dependent output (and transcoding
>>> would be reduced to a byte copy when the locale encoding is UTF-8).
>>>
>>> For the short term however, I'm content to just reserve the 'L' option;
>>> actually doing the work to support this can await further motivation and
>>> standard transcoding facilities.
>>>
>>> std::format("{:L}\n", ""); is ill-formed, so would be std::format("{:L}\n",
>>> u8"");
>>> https://eel.is/c++draft/format#string.std-17 (it's also used in chrono)
>>> https://compiler-explorer.com/z/58bsTaf3o
>>> Beside, reservation is not necessary, users cannot write formatters for
>>> types that do not depend on user-defined types (or, if you prefer, it's
>>> already reserved)
>>>
>>> Victor can correct me if I'm mistaken, but my understanding has been
>>> that changes to std::formatter specializations might cause (sometimes?)
>>> an ABI break. The following is recorded in the 2023-11-29 SG16 meeting
>>> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2023.md#november-29th-2023>
>>> notes from discussion of - P3045R0 (Quantities and units library)
>>> <https://wg21.link/p3045r0>.
>>>
>>> "Victor recommended reserving an 'L' option specifier in the format
>>> specification that would render the code ill-formed for now so as to allow
>>> extension later without an ABI break."
>>>
>>>
>>> But the issues with the whole scenario you are describing is that:
>>>
>>> 1.
>>> We keep trying to give meaning to programs where the execution encoding
>>> is not a superset of the literal encoding even though the encoding is
>>> generally not part of the type system
>>> So for 2 arbitrary strings a and b, concatenating them might not produce
>>> a good result, and we can't solve it.
>>>
>>> Such programs already have a meaning and have for the last 40 years or
>>> so. I agree there are sharp edges here that we can't fix.
>>>
>>> You will note that format makes the assumption that everything is in the
>>> literal encoding and it's working wonderfully well.
>>>
>>> std::format() does not ensure that the output produced is an any
>>> particular encoding. We spent a lot of time talking about whether
>>> std::format() produces text and eventually concluded that it is not
>>> required to do so.
>>>
>>> I am not arguing for a change in direction; in fact, I'm arguing with
>>> preserving consistent behavior with regard to its existing locale dependent
>>> behavior so that there is an option for *not* producing mojibake.
>>>
>>> It's certainly not perfect - i.e. we taught people to compile with /utf8
>>> on windows but the system is still not defaulting to UTF-8, but it's as
>>> good as we can reasonably get.
>>>
>>> Agreed. And for those that are able to use /utf8, that is great. I have
>>> no data, but I would bet a good deal of cash that the vast majority of code
>>> that is compiled with MSVC is not compiled with /utf-8.
>>>
>>>
>>> 2.
>>> When you ask for the name of August in Japanese, as a user, you probably
>>> don't expect part of your program to be encoded in some weird encoding that
>>> is different to the rest of the program.
>>> We try to patch that in format/chrono, but it's certainly not perfect
>>> https://eel.is/c++draft/time#format-3.sentence-3
>>>
>>> Thank you! That is the wording I was looking for with regard to my
>>> "vague recollections of discussions" statement above.
>>>
>>>
>>> Anyway, I'm not sure how that is relevant to the u8 discussion, L
>>> affects individual arguments, not the formatting string (the literal
>>> encoding is the ground truth for encoding as far as format is concerned)
>>>
>>> I hope the above better explains the relevance.
>>>
>>> Tom.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>>
>>>>
>>>>
>>>>>
>>>>> >
>>>>> > And then we just fall back to std::cout << "...", where the "..." is
>>>>> in the
>>>>> > narrow literal encoding and hence we assume works, more or less.
>>>>> Correct.
>>>>> >
>>>>> > And we don't want to make std::cout << u8"..." do that, because it
>>>>> can,
>>>>> > in principle, do better?
>>>>> Not because it can do better, but because there is more uncertainty
>>>>> about what the user might expect. If the user writes std::cout <<
>>>>> std::format(...), then that is an explicit opt in to the behavior that
>>>>> std::format() exhibits. But they might also want to just write UTF-8
>>>>> bytes unmodified regardless of what the ordinary literal encoding is.
>>>>> Or
>>>>> they might expect implicit transcoding to either the current locale or
>>>>> the environment locale or even the terminal locale. By not providing a
>>>>> default behavior, we give the programmer the opportunity to think
>>>>> about
>>>>> what they are actually trying to do.
>>>>>
>>>>
>>>> I don't quite buy this argument.
>>>> When cout << 42.0; outputs "42,0", the text nature, locale and
>>>> encodings were made for us.
>>>> If the programmer wants to be creative, one can consider io
>>>> manipulators.
>>>>
>>>> Consider printing of other localized names as in the example above.
>>>>
>>>> #include <chrono>
>>>> #include <format>
>>>> #include <iostream>
>>>> #include <iomanip>
>>>> #include <locale>
>>>> int main() {
>>>> std::cout << "Default locale: '" << std::cout.getloc().name() <<
>>>> "'\n";
>>>> std::cout << std::chrono::August << "\n";
>>>> std::cout.imbue(std::locale(""));
>>>> std::cout << "Environment locale: '" << std::cout.getloc().name() <<
>>>> "'\n";
>>>> std::cout << std::chrono::August << "\n";
>>>> std::cout.imbue(std::locale("ja_JP.utf8"));
>>>> std::cout << "Explicit locale: '" << std::cout.getloc().name() <<
>>>> "'\n";
>>>> std::cout << std::chrono::August << "\n";
>>>> }
>>>>
>>>> I get the following output running that locally with LANG=ja_JP.ujis.
>>>> Note the mojibake and corresponding substitution of replacement characters.
>>>>
>>>> Default locale: 'C'
>>>> Aug
>>>> Environment locale: ''
>>>> 8��
>>>> Explicit locale: 'ja_JP.utf8'
>>>> 8月
>>>>
>>>> The (well recognized) problem with iostreams is the implicit use of the
>>>> imbued locale. The consistent behavior for iostreams would be that
>>>> inserters and extractors for charN_t would transcode to the encoding
>>>> of the imbued locale. But that doesn't work well at all in the common case
>>>> where no locale has been explicitly imbued.
>>>>
>>>> Making a choice for std::format() is simpler because the programmer
>>>> chooses the locale behavior on a per-argument basis; there is a good
>>>> default.
>>>>
>>>>
>>>>
>>>>> >
>>>>> > But let me get back to your list.
>>>>> >
>>>>> >> 1. std::format() and std::print() are not implicitly locale
>>>>> dependent; that
>>>>> >> rules out selection of a locale dependent execution encoding.
>>>>> > What is in a locale-dependent execution encoding in std::cout <<
>>>>> u8"..."?
>>>>> iostreams implicitly consults either an imbued locale facet or the
>>>>> global locale for formatting operations. Think about std::cout <<
>>>>> std::chrono::Sunday. Depending on the current locale, this might print
>>>>> "Sun" or a localized weekday name in a locale dependent encoding.
>>>>>
>>>>
>>>> But again, the only thing we care about for u8 is the encoding.
>>>> And I am not aware of std::locale ever impacting that.
>>>>
>>>> I hope the above examples are motivating.
>>>>
>>>>
>>>>
>>>>> >
>>>>> >> 2. std::format() returns a std::string; that rules out selection
>>>>> of an I/O
>>>>> >> dependent encoding.
>>>>> > Same question. Where is the I/O dependent encoding in std::cout <<
>>>>> u8"..."
>>>>> > (that is not also present in std::cout << some_std_string)?
>>>>> In the latter case, we have to assume that some_std_string holds text
>>>>> in
>>>>> the encoding expected on the other end of the stream. We can't do that
>>>>> for u8"...", so we have to transcode to something (or have some other
>>>>> assurance that UTF-8 is intended and expected).
>>>>> >
>>>>> >> 3. std::print() writes to an I/O stream, but has special behavior
>>>>> for writes
>>>>> >> to a terminal; that rules out selection of a terminal encoding (as
>>>>> unnecessary,
>>>>> >> at least in important cases).
>>>>
>>>> > This doesn't apply here, because we're using std::format.
>>>>>
>>>>
>>>> Right, this is one of the reasons I feel less compelled to pursue
>>>> iostream surgery.
>>>> Output behavior is suboptimal on windows, and unlikely to be fixed.
>>>>
>>>> I am likewise not compelled to pursue iostream support.
>>>>
>>>> Agreed with later remarks below.
>>>>
>>>> Tom.
>>>>
>>>>
>>>>
>>>>> >> 5. std::format() and std::print() should have the same behavior
>>>>> (other than
>>>>> >> that std::print(...) may produce a better result than std::cout <<
>>>>> >> std::format(...) when the output is directed to a terminal).
>>>>> > OK... but this isn't relevant.
>>>>> The above two are relevant because we wouldn't want to differentiate
>>>>> behavior for formatting a u8"..." argument for std::format() vs
>>>>> std::print(). The latter helps to constrain the reasonable options for
>>>>> the former.
>>>>
>>>>
>>>> Right, print just does format and output the result
>>>>
>>>>
>>>>> >
>>>>> >> 6. std::format() and std::print() have additional guarantees when
>>>>> the
>>>>> >> ordinary/wide literal encoding is a UTF encoding.
>>>>> > What additional guarantees, and how do they help here?
>>>>>
>>>>> We specify additional constraints for fill characters, display width
>>>>> (well, normative encouragement), and formatting of escaped strings.
>>>>> None
>>>>> of these are relevant for reflection purposes; they help to reinforce
>>>>> a
>>>>> choice to depend on the ordinary/wide literal encoding for behavior of
>>>>> these functions. We don't have such precedent for iostreams.
>>>>>
>>>>
>>>> And you know, the format string is parsed in the ordinary encoding and
>>>> copied as-it
>>>>
>>>>
>>>>>
>>>>> Tom.
>>>>>
>>>>>
>>>>
>>> --
>>> SG16 mailing list
>>> SG16_at_[hidden]
>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>
>> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
wrote:
> I think we have different ideas about what is meant by the "code page
> programming model". What I am referring to is exactly the programming model
> that is used to facilitate support for many code pages where one of them is
> selected at run-time based on the current locale. This is the programming
> model that has been in use since the introduction of C and C++. A new
> programming model was introduced by wchar_t and UTF-16/UTF-32. Much of
> the work that we have been doing (and which std::print() helps to
> facilitate) has been focused on supporting another new programming model;
> the locale independent UTF-8 one.
>
> If you take the example from the paper, compile it with the ordinary
> literal encoding set to Windows-1251, run it on a Windows machine with a
> region setting that uses Windows-1251, and then run chcp 1251 to "fix"
> the console encoding, then the correct output will be displayed. Those
> limitations exist because the example explicitly uses characters that
> limits its application to a specific legacy encoding (in the code pages
> world).
>
> If you take the example and replace the sequence of Cyrillic characters
> with a call to a message catalog that produces an appropriate translation
> based on the current locale, then the example can be compiled with the
> ordinary literal encoding set to any code page and it will run as expected
> on any Windows machine regardless of region setting. Depending on the
> characters present in the translation provided by the message catalog,
> users might still have to run chcp 12XX to "fix" the console encoding
> though since it is not set consistently with the region encoding. Windows
> resource strings, the POSIX message catalog, GNU gettext, etc... have all
> supported this programming model for the last 40 years or so. This history
> is why I emphatically assert that there isn't anything broken with this
> programming model. It has limitations, it has sharp edges, and we can all
> rejoice that support for other programming models has emerged, but that
> doesn't change the fact that there is a lot of code written to this model
> that is still being maintained.
>
I think we are all on the same (code) page regarding terminology.
That code pages have been and still are widely used is undeniable. They
might even be used with some amount of success in the western world and run
happily in millions of devices right now.
That this model is unfit to satisfy our needs and that of our users can
also, independently be true. (The main novelty being that our users found
this thing called the Internet and got addicted to it, so we can no longer
pretend that Cyrillic text and English text will never intermingle.
The code page model mathematically represents less information, it is
fundamentally incompatible with our goals.
>From there, I hope we can find the right line between not breaking existing
code, which we should definitely avoid doing willy nilly, and being victims
of sunk cost fallacies.
> Tom.
> On 5/8/24 2:11 PM, Victor Zverovich wrote:
>
> What you call a "code page programming model" is broken because there is
> no single code page on Windows, there is a collection of incompatible code
> pages that can change at runtime. The example demonstrated in the paper
> shows this clearly: you get mojibake even in the ideal case where all code
> pages are static and correspond to a single localization, not some
> imaginary case where a user changed terminal encoding in an incompatible
> way that you give. This is the reason why modern output facilities such the
> one in Rust and std::print in C++23 avoid code pages completely. It's just
> impossible to make work in principle.
>
> You can ignore the non-Cyrillic part of the message, it is not relevant to
> the problem. It has nothing to do with UTF-8 specifically, the same is true
> for legacy encodings.
>
> - Victor
>
> On Wed, May 8, 2024 at 10:38 AM Tom Honermann <tom_at_[hidden]> wrote:
>
>> On 5/8/24 12:54 PM, Victor Zverovich wrote:
>>
>> > The ASCII and EBCDIC code page based locale programming model used on
>> POSIX and Windows systems is not broken.
>>
>> It is actually broken on Windows for reasons explained in
>> https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p2093r14.html.
>>
>> I'm not sure what you are referring to as broken. If you are referring to
>> characters not being displayed correctly in the Windows console due to the
>> console using a code page that is not aligned with the locale/environment
>> by default (because of backward compatibility with old DOS applications),
>> then yes, that is broken, but it is broken due to the inconsistent encoding
>> selection, not due to the code page based programming model. The same
>> behavior would be exhibited on Linux if the terminal encoding was changed
>> to CP437.
>>
>> Note that the motivating example in section 9 of that paper, std::cout
>> << "Привет, κόσμος!";, violates the code page paged programming model by
>> using characters in the string literal that do not have an invariant
>> representation in all supported locales. Further, since the example
>> explicitly specifies the use of UTF-8, the example is limited to locales
>> that use UTF-8 as their encoding. There is nothing wrong with that code of
>> course, it just requires a UTF-8 based programming model.
>>
>> Tom.
>>
>>
>> - Vicrtor
>>
>> On Fri, May 3, 2024 at 1:11 PM Tom Honermann via SG16 <
>> sg16_at_[hidden]> wrote:
>>
>>> On 5/2/24 6:35 PM, Corentin Jabot via SG16 wrote:
>>>
>>>
>>>
>>> On Thu, May 2, 2024 at 11:25 PM Tom Honermann <tom_at_[hidden]> wrote:
>>>
>>>> On 4/30/24 2:31 AM, Corentin Jabot via SG16 wrote:
>>>>
>>>>
>>>>
>>>> On Tue, Apr 30, 2024 at 12:45 AM Tom Honermann <tom_at_[hidden]>
>>>> wrote:
>>>>
>>>>> On 4/29/24 4:11 PM, Peter Dimov via SG16 wrote:
>>>>> > Tom Honermann wrote:
>>>>> >> I'm not entirely sure that cout << std::format("{}", u8"...")
>>>>> is that much
>>>>> >> easier
>>>>> >> to specify and support.
>>>>> >>
>>>>> >> But I'll be glad to be proven wrong, of course. :-)
>>>>> >>
>>>>> >> There is a relevant SO comment
>>>>> >> <https://stackoverflow.com/questions/58878651/what-is-the-printf-
>>>>> >> formatting-character-for-char8-t/58895428#58895428> .
>>>>> >>
>>>>> >> std::format() and std::print(), to some extent, improve the
>>>>> likelihood that an
>>>>> >> implementation selected encoding will be a good match for the
>>>>> programmer's
>>>>> >> intent. This is because:
>>>>> >>
>>>>> >> 1. std::format() and std::print() are not implicitly locale
>>>>> dependent; that
>>>>> >> rules out selection of a locale dependent execution encoding.
>>>>> >> 2. std::format() returns a std::string; that rules out selection
>>>>> of an I/O
>>>>> >> dependent encoding.
>>>>> >> 3. std::print() writes to an I/O stream, but has special behavior
>>>>> for writes
>>>>> >> to a terminal; that rules out selection of a terminal encoding (as
>>>>> unnecessary,
>>>>> >> at least in important cases).
>>>>> >> 4. std::format() and std::print() are both strongly associated
>>>>> with the
>>>>> >> ordinary/wide literal encoding.
>>>>> >> 5. std::format() and std::print() should have the same behavior
>>>>> (other than
>>>>> >> that std::print(...) may produce a better result than std::cout <<
>>>>> >> std::format(...) when the output is directed to a terminal).
>>>>> >> 6. std::format() and std::print() have additional guarantees when
>>>>> the
>>>>> >> ordinary/wide literal encoding is a UTF encoding.
>>>>> >>
>>>>> >>
>>>>> >> Due to those characteristics, we have good motivation for implicit
>>>>> use of the
>>>>> >> ordinary/wide literal encoding as the target for transcoding for
>>>>> std::format()
>>>>> >> and std::print().
>>>>> > I'm afraid that I don't quite understand.
>>>>> >
>>>>> > What does std::format( "{}", u8"..." ) actually do? I suppose it
>>>>> transcodes
>>>>> > the UTF-8 input into the narrow literal encoding (replacing
>>>>> irrepresentable
>>>>> > characters with '?' instead of throwing, I presume, or it would be
>>>>> not very
>>>>> > usable)?
>>>>>
>>>>> We'll have to see what Corentin proposes :)
>>>>>
>>>>> But yes, something very much like that.
>>>>>
>>>>> Note that we could also support std::format("{:L}", u8"...") to enable
>>>>> a
>>>>> programmer to explicitly request transcoding to a locale dependent
>>>>> encoding (either now or at some future point).
>>>>>
>>>>> (Corentin, at a minimum, we should reserve the L option in your paper).
>>>>>
>>>>
>>>> We have an opportunity to not conflate locale and encodings here.
>>>>
>>>> As much as I would like that to be the case, I don't think it is.
>>>>
>>>> u8"" is a known quantity here, it's utf-8.
>>>> But the target is also a known quantity, we very clearly decided it to
>>>> be the literal encoding, because we need to parse it, and
>>>> we wisely decided to assume a literal encoding. So the target encoding
>>>> is also a known quantity
>>>>
>>>> Unfortunately, that isn't the case when a programmer opts in to use of
>>>> a locale. Consider the following when the literal encoding is any ASCII
>>>> derived encoding and the global locale encoding is EUC-JP (ujis).
>>>>
>>>> #include <chrono>
>>>> #include <format>
>>>> #include <iostream>
>>>> #include <locale>
>>>> int main() {
>>>> std::locale::global(std::locale(""));
>>>> std::cout << std::format("{:L}\n", std::chrono::August);
>>>> }
>>>>
>>>> The resulting string will be formed from the literal encoding (for the
>>>> '\n' character) and the name of the month provided by the *formatting
>>>> locale <http://eel.is/c++draft/time.format#2>*. Nothing ensures that
>>>> the latter is converted to the literal encoding. Further, a validly encoded
>>>> string is produced so long as the characters used in the format string are
>>>> from the basic literal character set.
>>>>
>>>> In my environment (Linux, using a pre-release build of Clang 19 and
>>>> libc++), compiling the above with the default literal encoding (UTF-8) and
>>>> running it with LANG=ja_JP.ujis produces output in EUC-jp as expected;
>>>> note the iconv invocation.
>>>>
>>>> $ clang++ -std=c++23 -stdlib=libc++ t.cpp -o t
>>>> $ LANG=ja_JP.ujis ./t | iconv -f ujis -t utf-8
>>>> 8月
>>>>
>>>> (yes, that is the right output, it is convention for some translation
>>>> of month names to include the month number before the localized name).
>>>>
>>>> Long time SG16 participants will recall P2373R3 (Fixing locale
>>>> handling in chrono formatters) <https://wg21.link/p2372r3> and LWG 3547
>>>> <https://wg21.link/lwg3547>. There was relevant discussion during the 2021-04-28
>>>> SG16 meeting
>>>> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2021.md#april-28th-2021>
>>>> .
>>>>
>>>> I have vague recollections of discussions about requiring that locale
>>>> dependent translations be provided in the literal encoding when it is a UTF
>>>> one, but I haven't been able to identify any such recorded discussion. I
>>>> don't see anything in the current WP that would require this.
>>>>
>>>> Based on the above, I think that, at a minimum, the "L" option should
>>>> be reserved.
>>>>
>>>
>>> I'm not sure what you are arguing about because "L" can only be applied
>>> to things that can be "localized" (i.e. mangled horribly by POSIX).
>>>
>>> You are right that I wasn't very clear about what I'm suggesting. I'll
>>> try to clarify.
>>>
>>> The ASCII and EBCDIC code page based locale programming model used on
>>> POSIX and Windows systems is not broken. It does have sharp edges. Unicode
>>> and its associated encodings have enabled a new programming model with
>>> fewer constraints and pitfalls, but that has not completely displaced the
>>> code page based programming model nor do I think it ever will. The code
>>> page based programming model requires the following:
>>>
>>> 1. Since C and C++ programs start with the global locale set to "C",
>>> it is necessary to opt-in to locale dependent behavior by calling
>>> std::locale::global() and/or std::setlocale().
>>> 2. Such programs, in order to avoid mojibake, must constrain the use
>>> of compile-time selected characters encoded in the ordinary literal
>>> encoding to those that have an invariant representation in all supported
>>> locale dependent encodings
>>>
>>> There has been a lot of code written over the last 40 or so years that
>>> adheres to this model. Many such programs are effectively locale agnostic
>>> though full localization requires translations provided by message catalogs
>>> (that themselves rely on locale; GNU gettext
>>> <https://www.gnu.org/software/gettext/> and POSIX catopen
>>> <https://pubs.opengroup.org/onlinepubs/9699919799/functions/catopen.html>
>>> are relevant). In my opinion, these programs should continue to work and
>>> continue to benefit from C++ standard library enhancements.
>>>
>>> Let's look at that example from above again:
>>>
>>> std::cout << std::format("{:L}\n", std::chrono::August);
>>>
>>> Regardless of what the ordinary literal encoding is, if LANG is
>>> ja_JP.sjis, then valid Shift-JIS output will be produced. Likewise, if
>>> it is ja_JP.utf8, zh_CN.gb18030, or zh_TW.big5, valid output will be
>>> produced in those encodings. This is portable code that works on all
>>> platforms (with the right platform dependent locale names; those sadly are
>>> not portable).
>>>
>>> Let's now assume a hypothetical message catalog of translated strings
>>> that works similarly to gettext, but that provides UTF-8 encoded
>>> translations in char8_t.
>>>
>>> std::cout << std::format("{} {:L}\n", u8msg("In the month of"),
>>> std::chrono::August);
>>>
>>> If we unconditionally require the char8_t argument to be transcoded to
>>> the ordinary literal encoding, then mojibake will be produced unless the
>>> ordinary literal encoding happens to match the locale encoding.
>>>
>>> I strongly agree that, for std::format(), the default behavior should
>>> be that char8_t strings are transcoded to the ordinary literal encoding.
>>>
>>> What I am arguing for is that there should also be an option for the
>>> programmer to opt-in to locale based transcoding of arguments that
>>> potentially require transcoding. Thus:
>>>
>>> std::cout << std::format("{:L} {:L}\n", _u8("In the month of"),
>>> std::chrono::August);
>>>
>>> would portably produce correct locale dependent output (and transcoding
>>> would be reduced to a byte copy when the locale encoding is UTF-8).
>>>
>>> For the short term however, I'm content to just reserve the 'L' option;
>>> actually doing the work to support this can await further motivation and
>>> standard transcoding facilities.
>>>
>>> std::format("{:L}\n", ""); is ill-formed, so would be std::format("{:L}\n",
>>> u8"");
>>> https://eel.is/c++draft/format#string.std-17 (it's also used in chrono)
>>> https://compiler-explorer.com/z/58bsTaf3o
>>> Beside, reservation is not necessary, users cannot write formatters for
>>> types that do not depend on user-defined types (or, if you prefer, it's
>>> already reserved)
>>>
>>> Victor can correct me if I'm mistaken, but my understanding has been
>>> that changes to std::formatter specializations might cause (sometimes?)
>>> an ABI break. The following is recorded in the 2023-11-29 SG16 meeting
>>> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README-2023.md#november-29th-2023>
>>> notes from discussion of - P3045R0 (Quantities and units library)
>>> <https://wg21.link/p3045r0>.
>>>
>>> "Victor recommended reserving an 'L' option specifier in the format
>>> specification that would render the code ill-formed for now so as to allow
>>> extension later without an ABI break."
>>>
>>>
>>> But the issues with the whole scenario you are describing is that:
>>>
>>> 1.
>>> We keep trying to give meaning to programs where the execution encoding
>>> is not a superset of the literal encoding even though the encoding is
>>> generally not part of the type system
>>> So for 2 arbitrary strings a and b, concatenating them might not produce
>>> a good result, and we can't solve it.
>>>
>>> Such programs already have a meaning and have for the last 40 years or
>>> so. I agree there are sharp edges here that we can't fix.
>>>
>>> You will note that format makes the assumption that everything is in the
>>> literal encoding and it's working wonderfully well.
>>>
>>> std::format() does not ensure that the output produced is an any
>>> particular encoding. We spent a lot of time talking about whether
>>> std::format() produces text and eventually concluded that it is not
>>> required to do so.
>>>
>>> I am not arguing for a change in direction; in fact, I'm arguing with
>>> preserving consistent behavior with regard to its existing locale dependent
>>> behavior so that there is an option for *not* producing mojibake.
>>>
>>> It's certainly not perfect - i.e. we taught people to compile with /utf8
>>> on windows but the system is still not defaulting to UTF-8, but it's as
>>> good as we can reasonably get.
>>>
>>> Agreed. And for those that are able to use /utf8, that is great. I have
>>> no data, but I would bet a good deal of cash that the vast majority of code
>>> that is compiled with MSVC is not compiled with /utf-8.
>>>
>>>
>>> 2.
>>> When you ask for the name of August in Japanese, as a user, you probably
>>> don't expect part of your program to be encoded in some weird encoding that
>>> is different to the rest of the program.
>>> We try to patch that in format/chrono, but it's certainly not perfect
>>> https://eel.is/c++draft/time#format-3.sentence-3
>>>
>>> Thank you! That is the wording I was looking for with regard to my
>>> "vague recollections of discussions" statement above.
>>>
>>>
>>> Anyway, I'm not sure how that is relevant to the u8 discussion, L
>>> affects individual arguments, not the formatting string (the literal
>>> encoding is the ground truth for encoding as far as format is concerned)
>>>
>>> I hope the above better explains the relevance.
>>>
>>> Tom.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>>
>>>>
>>>>
>>>>>
>>>>> >
>>>>> > And then we just fall back to std::cout << "...", where the "..." is
>>>>> in the
>>>>> > narrow literal encoding and hence we assume works, more or less.
>>>>> Correct.
>>>>> >
>>>>> > And we don't want to make std::cout << u8"..." do that, because it
>>>>> can,
>>>>> > in principle, do better?
>>>>> Not because it can do better, but because there is more uncertainty
>>>>> about what the user might expect. If the user writes std::cout <<
>>>>> std::format(...), then that is an explicit opt in to the behavior that
>>>>> std::format() exhibits. But they might also want to just write UTF-8
>>>>> bytes unmodified regardless of what the ordinary literal encoding is.
>>>>> Or
>>>>> they might expect implicit transcoding to either the current locale or
>>>>> the environment locale or even the terminal locale. By not providing a
>>>>> default behavior, we give the programmer the opportunity to think
>>>>> about
>>>>> what they are actually trying to do.
>>>>>
>>>>
>>>> I don't quite buy this argument.
>>>> When cout << 42.0; outputs "42,0", the text nature, locale and
>>>> encodings were made for us.
>>>> If the programmer wants to be creative, one can consider io
>>>> manipulators.
>>>>
>>>> Consider printing of other localized names as in the example above.
>>>>
>>>> #include <chrono>
>>>> #include <format>
>>>> #include <iostream>
>>>> #include <iomanip>
>>>> #include <locale>
>>>> int main() {
>>>> std::cout << "Default locale: '" << std::cout.getloc().name() <<
>>>> "'\n";
>>>> std::cout << std::chrono::August << "\n";
>>>> std::cout.imbue(std::locale(""));
>>>> std::cout << "Environment locale: '" << std::cout.getloc().name() <<
>>>> "'\n";
>>>> std::cout << std::chrono::August << "\n";
>>>> std::cout.imbue(std::locale("ja_JP.utf8"));
>>>> std::cout << "Explicit locale: '" << std::cout.getloc().name() <<
>>>> "'\n";
>>>> std::cout << std::chrono::August << "\n";
>>>> }
>>>>
>>>> I get the following output running that locally with LANG=ja_JP.ujis.
>>>> Note the mojibake and corresponding substitution of replacement characters.
>>>>
>>>> Default locale: 'C'
>>>> Aug
>>>> Environment locale: ''
>>>> 8��
>>>> Explicit locale: 'ja_JP.utf8'
>>>> 8月
>>>>
>>>> The (well recognized) problem with iostreams is the implicit use of the
>>>> imbued locale. The consistent behavior for iostreams would be that
>>>> inserters and extractors for charN_t would transcode to the encoding
>>>> of the imbued locale. But that doesn't work well at all in the common case
>>>> where no locale has been explicitly imbued.
>>>>
>>>> Making a choice for std::format() is simpler because the programmer
>>>> chooses the locale behavior on a per-argument basis; there is a good
>>>> default.
>>>>
>>>>
>>>>
>>>>> >
>>>>> > But let me get back to your list.
>>>>> >
>>>>> >> 1. std::format() and std::print() are not implicitly locale
>>>>> dependent; that
>>>>> >> rules out selection of a locale dependent execution encoding.
>>>>> > What is in a locale-dependent execution encoding in std::cout <<
>>>>> u8"..."?
>>>>> iostreams implicitly consults either an imbued locale facet or the
>>>>> global locale for formatting operations. Think about std::cout <<
>>>>> std::chrono::Sunday. Depending on the current locale, this might print
>>>>> "Sun" or a localized weekday name in a locale dependent encoding.
>>>>>
>>>>
>>>> But again, the only thing we care about for u8 is the encoding.
>>>> And I am not aware of std::locale ever impacting that.
>>>>
>>>> I hope the above examples are motivating.
>>>>
>>>>
>>>>
>>>>> >
>>>>> >> 2. std::format() returns a std::string; that rules out selection
>>>>> of an I/O
>>>>> >> dependent encoding.
>>>>> > Same question. Where is the I/O dependent encoding in std::cout <<
>>>>> u8"..."
>>>>> > (that is not also present in std::cout << some_std_string)?
>>>>> In the latter case, we have to assume that some_std_string holds text
>>>>> in
>>>>> the encoding expected on the other end of the stream. We can't do that
>>>>> for u8"...", so we have to transcode to something (or have some other
>>>>> assurance that UTF-8 is intended and expected).
>>>>> >
>>>>> >> 3. std::print() writes to an I/O stream, but has special behavior
>>>>> for writes
>>>>> >> to a terminal; that rules out selection of a terminal encoding (as
>>>>> unnecessary,
>>>>> >> at least in important cases).
>>>>
>>>> > This doesn't apply here, because we're using std::format.
>>>>>
>>>>
>>>> Right, this is one of the reasons I feel less compelled to pursue
>>>> iostream surgery.
>>>> Output behavior is suboptimal on windows, and unlikely to be fixed.
>>>>
>>>> I am likewise not compelled to pursue iostream support.
>>>>
>>>> Agreed with later remarks below.
>>>>
>>>> Tom.
>>>>
>>>>
>>>>
>>>>> >> 5. std::format() and std::print() should have the same behavior
>>>>> (other than
>>>>> >> that std::print(...) may produce a better result than std::cout <<
>>>>> >> std::format(...) when the output is directed to a terminal).
>>>>> > OK... but this isn't relevant.
>>>>> The above two are relevant because we wouldn't want to differentiate
>>>>> behavior for formatting a u8"..." argument for std::format() vs
>>>>> std::print(). The latter helps to constrain the reasonable options for
>>>>> the former.
>>>>
>>>>
>>>> Right, print just does format and output the result
>>>>
>>>>
>>>>> >
>>>>> >> 6. std::format() and std::print() have additional guarantees when
>>>>> the
>>>>> >> ordinary/wide literal encoding is a UTF encoding.
>>>>> > What additional guarantees, and how do they help here?
>>>>>
>>>>> We specify additional constraints for fill characters, display width
>>>>> (well, normative encouragement), and formatting of escaped strings.
>>>>> None
>>>>> of these are relevant for reflection purposes; they help to reinforce
>>>>> a
>>>>> choice to depend on the ordinary/wide literal encoding for behavior of
>>>>> these functions. We don't have such precedent for iostreams.
>>>>>
>>>>
>>>> And you know, the format string is parsed in the ordinary encoding and
>>>> copied as-it
>>>>
>>>>
>>>>>
>>>>> Tom.
>>>>>
>>>>>
>>>>
>>> --
>>> SG16 mailing list
>>> SG16_at_[hidden]
>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>
>> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
Received on 2024-05-08 18:53:29