sg16: Re: [SG16] Agenda for the 2021-04-28 SG16 telecon

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 27 Apr 2021 11:42:25 -0400

On 4/27/21 2:34 AM, Corentin Jabot wrote:
>
>
> On Tue, Apr 27, 2021 at 5:57 AM Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> On 4/26/21 1:04 PM, Corentin Jabot via SG16 wrote:
>>
>>
>> On Mon, Apr 26, 2021 at 6:19 PM Tom Honermann via SG16
>> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>>
>> On 4/19/21 10:58 AM, Tom Honermann via SG16 wrote:
>>>
>>> SG16 will hold a telecon on Wednesday, April 28th at 19:30
>>> UTC (timezone conversion
>>> <https://www.timeanddate.com/worldclock/converter.html?iso=20210428T193000&p1=1440&p2=tz_pdt&p3=tz_mdt&p4=tz_cdt&p5=tz_edt&p6=tz_cest>).
>>>
>>> The agenda is:
>>>
>>> * P2093R5: Formatted output <https://wg21.link/p2093r5>
>>> * P2348R0: Whitespaces Wording Revamp
>>> <https://isocpp.org/files/papers/P2348R0.pdf>
>>>
>>> LEWG discussed P2093R5 at their 2021-04-06 telecon and
>>> decided to refer the paper back to SG16 for further
>>> discussion. LEWG meeting minutes are available here
>>> <https://wiki.edg.com/bin/view/Wg21telecons2021/P2093#Library-Evolution-2021-04-06>;
>>> please review them prior to the telecon. LEWG reviewed the
>>> list of prior SG16 deferred questions posted to them here
>>> <http://lists.isocpp.org/lib-ext/2021/03/18189.php>. Of
>>> those, they established consensus on an answer for #2 (they
>>> agreed not to block std::print() on a proposal for
>>> underlying terminal facilities), but referred the rest back
>>> to us. My interpretation of their actions is that LEWG
>>> would like a revision of the paper to address these concerns
>>> based on SG16 input (e.g., discuss design options and SG16
>>> consensus or lack thereof). We'll therefore focus on these
>>> questions at this telecon.
>>>
>>> Hubert provided the following very interesting example usage.
>>>
>>> std::print("{:%r}\n",
>>> std::chrono::system_clock::now().time_since_epoch());
>>>
>>> At issue is the encoding used by locale sensitive chrono
>>> formatters. Search [time.format]
>>> <http://eel.is/c++draft/time.format> for "locale" to find
>>> example chrono format specifiers that are locale dependent.
>>> The example above contains the %r specifier and is locale
>>> sensitive because AM/PM designations may be localized. In a
>>> Chinese locale the desired translation of "PM" is "下午", but
>>> the locale will provide the translation in the locale
>>> encoding. As specified in P2093R5, if the execution
>>> (literal) encoding is UTF-8, than std::print() will expect
>>> the translation to be provided in UTF-8, but if the locale
>>> is not UTF-8-based (e.g., Big5; perhaps Shift-JIS for the
>>> Japanese 午後 translation), then the result is mojibake. This
>>> is a good example of how locale conflates translation and
>>> character encoding.
>>>
>>> Addressing the above will be our first order of business.
>>> Please reserve some time to independently think about this
>>> problem (ignore responses to this message for a few days if
>>> you need to). I am explicitly not listing possible
>>> approaches to address this concern in this message so as to
>>> avoid adding (further) bias in any specific direction. I
>>> suspect the answers to the previously deferred SG16
>>> questions will be easier to answer once this concern is
>>> resolved.
>>>
>> Now that we've all had some time to think about this issue,
>> here are some possible directions we can pursue to resolve
>> it. These are presented in no particular order.
>>
>> * Specialize std::locale facets
>> <https://en.cppreference.com/w/cpp/locale/locale> and
>> related I/O manipulators like std::put_time()
>> <https://en.cppreference.com/w/cpp/io/manip/put_time> for
>> char8_t. This would allow std::print() to, when the
>> literal encoding is UTF-8, opt-in to use of the
>> UTF-8/char8_t facets and I/O manipulators.
>> * When the literal encoding is UTF-8, stipulate that
>> running the program in a non-UTF-8 based locale is
>> non-conforming. This would effectively require MSVC
>> programmers to, when building code with the /utf-8
>> option, to also force selection of a UTF-8 code page via
>> a manifest
>> <https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page>
>> and require use of Windows 10 build 1903 or later.
>> * When the literal encoding is UTF-8, specify that
>> non-UTF-8 based locale dependent translations be
>> implicitly transcoded (such transcoding should never
>> result in errors except perhaps for memory allocation
>> failures).
>> * Drop the special case handling for the literal encoding
>> being UTF-8 and specify that, when bypassing a stream to
>> write directly to the console, that the output be
>> implicitly transcoded from the current locale dependent
>> encoding (whatever it is) to the console encoding (UTF-8).
>>
>>
>> We have 2 things to explain to LEWG for print. And we do not need
>> to operate change to the design, just to explain things to them
>> in a terms they can understand (and they want to rely on our
>> expertise which
>> implies consensus among ourselves)
>>
>> 1. It is always non-sense to interpret a string in encoding X
>> when it is in fact not.
>> 2. From there, if a string literal is in UTF-8, we HAVE to assume
>> the execution encoding is also utf-8. Why rely on the literal
>> encoding and not execution? it is resilient to call to setlocale
>> and more efficient. Also, format strings are likely to be literals.
>> 3. From there if that string is displayed on a
>> terminal/console/screen/tty, it is text. So it has to be rendered
>> correctly. On a specific system (windows) there is a way to
>> enforce that. Because windows has a separate mechanism for
>> unicode display and console handling that exists independently of
>> the C++ execution encoding.
>> 4. "we have to assume" in 2. implies a precondition. That is true
>> REGARDLESS of utf-8 or not. in all cases the format string has to
>> be interpreted as text, which assumes it is valid in the
>> execution encoding. CF the Microsoft STL issue for braces in
>> shift JS.
>> 5. This means that converting to UTF-16 on windows for the
>> purpose of console display is always valid (no ""transcosding""
>> error) within the contract of the function, and as such does not
>> have to be specified. Preconditions violations are UB within the
>> standard library and we should keep doing that. In practice the
>> implementation (which is here the terminal, not the stl) will do
>> character replacement the best it can, or render something horrible.
>
> I agree with all of that, but I don't see how it relates to the
> problematic example above. The issue with the example is that the
> "%r" field specifier may cause non-UTF-8 content supplied by the
> locale to be written.
>
> I see two problems here.
> One is that this should not be locale dependent by default - has that
> been discussed? It seems to run amok of fmt design.
Agreed; other email threads are now addressing that.
>
> The other is that, if print("xxx{}", foo) assumes that xxx is utf8,
> and the formated result is displayed onto a terminal, then the entire
> thing _has to_ be utf-8. note that this is because of
> a precondition on the act if displaying on the terminal which has
> nothing to do with formatting it's a 2 step process format -> print on
> terminal both of which have different preconditions (formating puts a
> requirement on the format string, to parse it, print additionally puts
> preconditions that the resulting thing will be utf8 such that
> individual arguments have to be to.

I think we are agreed here, but perhaps looking at the problem from
different perspectives.

It sounds like your position is, if the locale uses a non-UTF-8 encoding
(when literal encoding is UTF-8), that a precondition violation occurs
and we get UB (effectively, the 2nd possible direction I listed). I
think that is a valid perspective.

Some of the other options that I listed are intended to avoid the
precondition by having std::print() (and std::format()) just do the
right thing by transcoding the locale sensitive data requested by the
format field specifier from the locale encoding to UTF-8.

>>
>> The locale in there is a red herring. Changing the execution
>> encoding is always dicey - all strings that were correctly
>> interpreted correctly before the locale change are potentially no
>> longer
>> correctly interpreted because their encoding no longer matches
>> the new execution encoding.
>> The existence of a setlocale function doesn't imply that calling
>> it leads to sensible results if the locale change also changes
>> the encoding :)
> The example doesn't assume a locale change, at least not beyond an
> initial std::setlocale(LC_ALL, "") during program startup.
>>
>>
>> > Specialize std::locale facets
>> <https://en.cppreference.com/w/cpp/locale/locale> and related I/O
>> manipulators like std::put_time()
>> <https://en.cppreference.com/w/cpp/io/manip/put_time> for
>> char8_t. This would allow std::print() to, when the literal
>> encoding is UTF-8, opt-in to use of the UTF-8/char8_t facets and
>> I/O manipulators.
>>
>> This is a different issue, one Peter and I have discussed. we
>> should not try to shove char into char8_t. Both char8_t and utf-8
>> char are valid use cases. Also, the whole point of fmt::print is
>> to avoid all of that :)
>
> I think this is strongly related, or we are misunderstanding each
> other. I see the point of std::print() being to bypass the
> implicit (wrong) console transcoding.
>
> fmt::print just dumps the bytes in the general case, similarly to
> printf, that is then interpreted incorrectly by the windows console. I
> don't see where there might be transcoding
> in the program (I expect the console to do interesting things, but
> that's outside of C++).
>
> C++ thinks a string is Utf-8
> System (incorrectly) disagrees
> System has a method that allows it to agree
> Do we use that method?

I think we've been focusing on different things here. The issue I'm
trying to discuss is independent of use of the
write-directly-to-the-console method. This discussion is about having
std::print() (and std::format()) internally ensure that that format
arguments provided by the locale are transcoded to match the encoding of
the format string. This happens before anything is written to the
console; this is the step where the formatting is done and the intent is
to ensure that well-formed text is produced *before* it is transcoded to
the native console encoding (whether that be UTF-8, UTF-16, whatever).
Transcoding requires well-formed input of course.

Does this help to get us on the same page?

>
> I strongly agree that char8_t and UTF-8 char are valid use cases.
>
>>
>> > When the literal encoding is UTF-8, stipulate that running the
>> program in a non-UTF-8 based locale is non-conforming. This
>> would effectively require MSVC programmers to, when building code
>> with the /utf-8 option, to also force selection of a UTF-8 code
>> page via a manifest
>> <https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page>
>> and require use of Windows 10 build 1903 or later.
>>
>> If you program contains literals that are not correctly
>> interpreted by the execution encoding, the behavior of your
>> program cannot be correct <insert scary U word>. So they should
>> probably do that but it seems out of scope.
>> The literalS encoding and the execution encoding should be
>> consistent (each string literal should be correctly interpreted).
>>
>> > When the literal encoding is UTF-8, specify that non-UTF-8
>> based locale dependent translations be implicitly transcoded
>> Sorry, can you detail what you mean? I do not understand, sorry
> In the example above, the "%r" field specifier indicates that a
> locale dependent 12-hour clock time be formatted. The AM/PM
> designator to be formatted is locale dependent. If the locale is
> not UTF-8 based, then mojibake is produced (if the literal
> encoding is UTF-8). This suggestion addresses the problem by
> implicitly transcoding the locale dependent AM/PM designator from
> the locale encoding to UTF-8 when formatting the output.
>
>
> Think about cases in which that can happen
> There is a non-utf8 locale and a utf8 string literal mixed together.
Yes, exactly, that is the issue. This discussion is about what we do
about it. We can call it UB (though I don't find that particularly
reasonable) or we can specify that locale provided strings be implicitly
transcoded (within std::print() / std::format()) to UTF-8 (to match the
encoding of the format string).
>
>>
>> > Drop the special case handling for the literal encoding being
>> UTF-8 and specify that, when bypassing a stream to write directly
>> to the console, that the output be implicitly transcoded from the
>> current locale dependent encoding (whatever it is) to the console
>> encoding (UTF-8).
>>
>> Dropping the special case seems more difficult in terms of wording.
> I think it is simpler actually; we would just have to say that the
> implicit transcoding is from the locale encoding to the console
> encoding.
>
>
> It's really hard to know what the console encoding is (it is a very
> microsoft specific thing), and the windows console basically have a
> wide (utf16) and narrow encoding (not sure it works exactly like that
> but it's a good enough model)
> Transcoding in the general case might be worse.

I think we're talking about different things here again. I meant the
native console encoding; e.g., the encoding that Microsoft's
WriteConsoleW() expects (UTF-16). I don't mean the broken ANSI console
encoding.

Tom.

> A wording that encourages vendors to... encourage utf8 content to not
> be misinterpreted as something else might help but good luck wording that!
> Especially as it needs to handle file redirection, etc
>
>> If everything else fails, Microsoft could do the sensible thing
>> as a matter of QOL.
>
> Agreed.
>
> Tom.
>
>>
>> Please feel free to comment on these, or additional,
>> approaches before our meeting on Wednesday.
>>
>> I think it would benefit LEWG if a revision of the paper
>> presented each of these possibilities, the consequences, and
>> the rationale (and hopefully SG16 consensus) for the proposed
>> direction.
>>
>> Tom.
>>
>>> I do not intend to time limit discussion of P2093R5 as I
>>> believe this is an important matter to resolve. If we are
>>> able to complete discussion of P2093R5, then we'll discuss
>>> P2348R0.
>>>
>>> Tom.
>>>
>>>
>>
>> --
>> SG16 mailing list
>> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>>
>

Received on 2021-04-27 10:42:29