sg16: Re: [SG16] Agenda for the 2021-05-12 SG16 telecon

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Wed, 19 May 2021 18:15:00 -0400

On Wed, May 19, 2021 at 2:17 PM Victor Zverovich <victor.zverovich_at_[hidden]>
wrote:

> > I'm concerned that deployment experience might be limited to specific
> environments.
>
> The great thing about the current implementation (and proposal) is that it
> is consistent with printf's behavior on common C standard library
> implementations on non-Windows platforms, so we have all the deployment
> experience in the world.
>

By virtue of having wording around Unicode-native output devices
(especially ones requiring transcoding from UTF-8) that is only known to
exist for Windows at this time. But that's not the point: Do the users of
the current implementation deploy their applications to a diverse range of
environments? How much Internationalization, Localization, and
Globalization do they do? How old was their code base?

> This is also why I am reluctant to innovate in this area. There has been a
> lot of usage experience on Windows as well and there is much less variation
> there.
>
> > I think the non-Unicode function is awkwardly named.
>
> Naming suggestions are welcome!
>

std::iostreams::print for `print` that never calls the _unicode versions.

>
> Cheers,
> Victor
>
> On Wed, May 12, 2021 at 12:51 PM Hubert Tong <
> hubert.reinterpretcast_at_[hidden]> wrote:
>
>> On Wed, May 12, 2021 at 3:14 PM Victor Zverovich <
>> victor.zverovich_at_[hidden]> wrote:
>>
>>> Hi Hubert,
>>>
>>> Thanks for the suggestions, I'll try incorporating them in the next
>>> iteration of the paper.
>>>
>>> > I think it would help if the point was stated more explicitly ...
>>>
>>> Good idea, will clarify this.
>>>
>>> > The paper can at least acknowledge that "polyglot" string literals
>>> exist ...
>>>
>>> Sure.
>>>
>>> > we'll end up with cases where the literal encoding is UTF-8 but the
>>> user won't want the UTF-8 std::print behaviour to potentially kick in.
>>>
>>> I am a bit skeptical because I haven't seen any reports about cases like
>>> this from the extensive usage experience of this feature. We can't fix
>>> clearly broken things and be bug-to-bug compatible with legacy APIs at the
>>> same time.
>>>
>>> > At least two cases come to mind.
>>>
>>> I don't think we can do much if users decide to lie about the encoding.
>>> We should make the common case work rather than try making everyone happy
>>> and support theoretical use cases not backed by actual implementation and
>>> usage experience.
>>>
>>
>> I'm concerned that deployment experience might be limited to specific
>> environments. I expect the conditions for the second scenario are met very
>> easily on *nix and also very difficult to test for (requires some sort of
>> special test environment/harness).
>>
>>
>>> That said they can always use nonunicode function or continue using
>>> their legacy APIs in those cases.
>>>
>>
>> I think the non-Unicode function is awkwardly named.
>>
>>
>>>
>>> Cheers,
>>> Victor
>>>
>>>
>>>
>>>
>>> On Tue, May 11, 2021 at 8:23 PM Hubert Tong <
>>> hubert.reinterpretcast_at_[hidden]> wrote:
>>>
>>>> On Tue, May 11, 2021 at 8:41 PM Victor Zverovich via SG16 <
>>>> sg16_at_[hidden]> wrote:
>>>>
>>>>> Dear Unicoders,
>>>>>
>>>>> Here is a link to a new revision of P2093:
>>>>> https://isocpp.org/files/papers/D2093R6.html. It's essentially the
>>>>> same as R5 but addresses the latest LEWG feedback and adds a few
>>>>> clarifications. The only change to the wording is replacing <io> with
>>>>> <print>.
>>>>>
>>>>
>>>> Thanks Victor.
>>>>
>>>> With respect to the choice to transcoding, it took me a while to catch
>>>> on to the statement being made. I think it would help if the point was
>>>> stated more explicitly that the choice to perform replacement during
>>>> transcoding is because that is consistent with the treatment of malformed
>>>> UTF-8 for UTF-8-native terminals and the choice not to transcode in the
>>>> case where the terminal is UTF-8 native is because we expect the terminal
>>>> to behave predictably as-is we did do the "transcoding".
>>>>
>>>> I'm still not entirely convinced about the argument surrounding the
>>>> choice of using the literal encoding though. The paper can at least
>>>> acknowledge that "polyglot" string literals exist and partially obviates
>>>> the insistence that the literal encoding being UTF-8 according to the build
>>>> system/build mode means that the involvement of non-UTF-8 strings in the
>>>> vicinity of std::print constitutes "mixing encodings".
>>>>
>>>> I really think that, just for predictability surrounding the display of
>>>> substitution text, we'll end up with cases where the literal encoding is
>>>> UTF-8 but the user won't want the UTF-8 std::print behaviour to potentially
>>>> kick in.
>>>>
>>>> At least two cases come to mind:
>>>> (1) Printing using both legacy interfaces and std::print where the
>>>> legacy interfaces are not using UTF-8 may appear fine on some terminals but
>>>> would result, on redirect, in output with mixed encoding.
>>>>
>>>> (2) std::print where the literal encoding is UTF-8 but the literals are
>>>> all "polyglot" and substitution strings that are not UTF-8 can appear to be
>>>> okay when redirecting or printing to non-Unicode terminals; however, once
>>>> deployed to a Unicode terminal, replacement characters show up (even if the
>>>> output is properly encoded for the underlying C output interface).
>>>>
>>>>
>>>>>
>>>>> Cheers,
>>>>> Victor
>>>>>
>>>>> On Tue, May 11, 2021 at 11:02 AM Tom Honermann via SG16 <
>>>>> sg16_at_[hidden]> wrote:
>>>>>
>>>>>> Reminder that this meeting is taking place tomorrow.
>>>>>>
>>>>>> Per suggestion by Peter, the agenda order is being changed to review
>>>>>> the updates in P2295R2 before D2372R1 and P2093R5 in the hopes that we can
>>>>>> forward P2295R2 to EWG. We'll try to limit that discussion to 30 minutes.
>>>>>> The updated agenda is below. Again, we are unlikely to get to P2348R0 at
>>>>>> all.
>>>>>>
>>>>>> - P2295R2: Support for UTF-8 as a portable source file encoding
>>>>>> <https://wg21.link/p2295r3>
>>>>>> - Review updates intended to address prior SG16 feedback.
>>>>>> - D2372R1: Fixing locale handling in chrono formatters
>>>>>> <https://isocpp.org/files/papers/D2372R1.html>
>>>>>> - Affirm or rebut LEWGs position.
>>>>>> - P2093R5: Formatted output <https://wg21.link/p2093r5>
>>>>>> - Discuss locale dependent character encoding concerns.
>>>>>> - P2348R0: Whitespaces Wording Revamp
>>>>>> <https://wg21.link/p2348r0>
>>>>>>
>>>>>> Tom.
>>>>>>
>>>>>> On 5/4/21 12:06 AM, Tom Honermann via SG16 wrote:
>>>>>>
>>>>>> SG16 will hold a telecon on Wednesday, May 12th at 19:30 UTC (timezone
>>>>>> conversion
>>>>>> <https://www.timeanddate.com/worldclock/converter.html?iso=20210512T193000&p1=1440&p2=tz_pdt&p3=tz_mdt&p4=tz_cdt&p5=tz_edt&p6=tz_cest>
>>>>>> ).
>>>>>>
>>>>>> The agenda is:
>>>>>>
>>>>>> - D2372R1: Fixing locale handling in chrono formatters
>>>>>> <https://isocpp.org/files/papers/D2372R1.html>
>>>>>> - Affirm or rebut LEWGs position.
>>>>>> - P2093R5: Formatted output <https://wg21.link/p2093r5>
>>>>>> - Discuss locale dependent character encoding concerns.
>>>>>> - P2295R2: Support for UTF-8 as a portable source file encoding
>>>>>> <https://wg21.link/p2295r3>
>>>>>> - Review updates intended to address prior SG16 feedback.
>>>>>> - P2348R0: Whitespaces Wording Revamp
>>>>>> <https://wg21.link/p2348r0>
>>>>>>
>>>>>> Our last telecon was consumed by discussion
>>>>>> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README.md#april-28th-2021>
>>>>>> of LWG3547 <https://cplusplus.github.io/LWG/issue3547> and possible
>>>>>> remedies. Though we did not reach consensus on a direction forward during
>>>>>> that telecon, Victor and Corentin, at the LEWG chair's request, drafted
>>>>>> D2372R0, presented it at the LEWG telecon held 2021-05-03
>>>>>> <https://wiki.edg.com/bin/view/Wg21telecons2021/P2372#2021-05-03>,
>>>>>> and LEWG reached strong consensus for it. The D2372R0 revision will be
>>>>>> submitted for the May mailing as P2372R0; and a D2372R1
>>>>>> <https://isocpp.org/files/papers/D2372R1.html> revision addressing
>>>>>> LEWG feedback will be submitted as P2372R1. Both revisions substantially
>>>>>> match the proposed resolution that SG16 discussed. Since SG16 did not
>>>>>> reach consensus on that direction, the LEWG chair has asked that we revisit
>>>>>> it to either affirm or rebut the LEWG consensus. We will therefore
>>>>>> (briefly) discuss and then poll that direction. Note that the poll taken
>>>>>> in SG16 differs from the poll taken in LEWG. In SG16, we polled applying
>>>>>> the proposed resolution to C++23 while LEWG polled applying the proposed
>>>>>> resolution (with amendments to not change behavior for iostream
>>>>>> manipulators) to C++23 *and* retroactively to C++20.
>>>>>>
>>>>>> Once we've dispatched D2372R1, we'll return to the original agenda
>>>>>> for our last telecon; discussion of P2093R5
>>>>>> <https://wg21.link/p2093r5> (Formatted output) and P2295R2
>>>>>> <https://wg21.link/p2295r3> (Support for UTF-8 as a portable source
>>>>>> file encoding). I've retained P2348R0 <https://wg21.link/p2348r0>
>>>>>> on the agenda, though I don't expect that we'll get to it.
>>>>>>
>>>>>> With regard to P2093R5 <https://wg21.link/p2093r5>, the current
>>>>>> status is that LEWG has referred the paper back to SG16 for further
>>>>>> discussion; please see the LEWG meeting minutes here
>>>>>> <https://wiki.edg.com/bin/view/Wg21telecons2021/P2093#Library-Evolution-2021-04-06>.
>>>>>> Specifically, LEWG would benefit from additional analysis of previously
>>>>>> deferred questions
>>>>>> <http://lists.isocpp.org/lib-ext/2021/03/18189.php> regarding
>>>>>> character encoding concerns, transcoding requirements (or the lack there
>>>>>> of) and the ensuing consequences (or lack there of).
>>>>>>
>>>>>> 1. How errors in transcoding should be handled. E.g., when
>>>>>> transcoding from UTF-8 to a UTF-16 based console interface and the UTF-8
>>>>>> input is not well-formed.
>>>>>> 2. The choice to base behavior on the compile-time choice of
>>>>>> literal encoding. An implication of the current proposal is that a program
>>>>>> that contains only ASCII characters in string literals will change behavior
>>>>>> depending on whether the literal encoding is UTF-8 vs ASCII (or some other
>>>>>> ASCII derived encoding).
>>>>>> 3. Whether transcoding to the console interface encoding should
>>>>>> be performed when the literal encoding is not UTF-8.
>>>>>> 4. What the implications are for future support of std::print("{}
>>>>>> {} {} {}", L"Wide text", u8"UTF-8 text", u"UTF-16 text", U"UTF-32
>>>>>> text").
>>>>>>
>>>>>> I think these concerns will be easier to resolve if we first reach
>>>>>> consensus regarding scenarios in which localized text may be provided in an
>>>>>> unexpected encoding. The following is a slightly modified example of code
>>>>>> Hubert previously provided. The example has been modified to explicitly
>>>>>> opt into localized chrono formatting.
>>>>>>
>>>>>> std::print("{:L%p}\n",
>>>>>> std::chrono::system_clock::now().time_since_epoch());
>>>>>>
>>>>>> At issue is the encoding used by locale sensitive chrono formatters.
>>>>>> The example above contains the %p specifier and is locale sensitive
>>>>>> because AM/PM designations may be localized. In a Chinese locale the
>>>>>> desired translation of "PM" is "下午", but the locale will provide the
>>>>>> translation in the locale encoding. As specified in P2093R5, if the
>>>>>> literal encoding is UTF-8, than std::print() will expect the
>>>>>> translation to be provided in UTF-8, but if the locale is not UTF-8-based
>>>>>> (e.g., Big5; perhaps Shift-JIS for the Japanese 午後 translation), then the
>>>>>> result is mojibake.
>>>>>>
>>>>>> I had previously suggested the following possible directions we can
>>>>>> investigate to resolve the encoding concerns.
>>>>>>
>>>>>> - Specialize std::locale facets
>>>>>> <https://en.cppreference.com/w/cpp/locale/locale> and related I/O
>>>>>> manipulators like std::put_time()
>>>>>> <https://en.cppreference.com/w/cpp/io/manip/put_time> for char8_t.
>>>>>> This would allow std::print() to, when the literal encoding is
>>>>>> UTF-8, opt-in to use of the UTF-8/char8_t facets and I/O
>>>>>> manipulators.
>>>>>> - When the literal encoding is UTF-8, stipulate that running the
>>>>>> program in a non-UTF-8 based locale is non-conforming. This would
>>>>>> effectively require MSVC programmers to, when building code with the
>>>>>> /utf-8 option, to also force selection of a UTF-8 code page via a
>>>>>> manifest
>>>>>> <https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page>
>>>>>> and require use of Windows 10 build 1903 or later.
>>>>>> - When the literal encoding is UTF-8, specify that non-UTF-8
>>>>>> based locale dependent translations be implicitly transcoded (such
>>>>>> transcoding should never result in errors except perhaps for memory
>>>>>> allocation failures).
>>>>>> - Drop the special case handling for the literal encoding being
>>>>>> UTF-8 and specify that, when bypassing a stream to write directly to the
>>>>>> console, that the output be implicitly transcoded from the current locale
>>>>>> dependent encoding (whatever it is) to the console encoding (UTF-8).
>>>>>>
>>>>>> If we get through all of that, we'll review Corentin's updates in
>>>>>> P2295R2 <https://wg21.link/p2295r3> to address prior SG16 feedback.
>>>>>> Thank you to everyone that already provided additional feedback on the
>>>>>> mailing list!
>>>>>>
>>>>>> Tom.
>>>>>>
>>>>>>
>>>>>> --
>>>>>> SG16 mailing list
>>>>>> SG16_at_[hidden]
>>>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>>>>
>>>>> --
>>>>> SG16 mailing list
>>>>> SG16_at_[hidden]
>>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>>>
>>>>

Received on 2021-05-19 17:15:33