sg16: Re: [SG16] Agenda for the 2021-05-12 SG16 telecon

From: Victor Zverovich <victor.zverovich_at_[hidden]>
Date: Wed, 26 May 2021 05:49:12 -0700

> Do we have to publically expose these?

Yes, they are useful for library writers in the same way vformat overloads
are.

- Victor

On Tue, May 25, 2021 at 11:59 AM Corentin Jabot via SG16 <
sg16_at_[hidden]> wrote:

>
>
> On Thu, May 20, 2021 at 12:34 AM Tom Honermann via SG16 <
> sg16_at_[hidden]> wrote:
>
>> On 5/19/21 2:17 PM, Victor Zverovich via SG16 wrote:
>>
>> > I'm concerned that deployment experience might be limited to specific
>> environments.
>>
>> The great thing about the current implementation (and proposal) is that
>> it is consistent with printf's behavior on common C standard library
>> implementations on non-Windows platforms, so we have all the deployment
>> experience in the world. This is also why I am reluctant to innovate in
>> this area. There has been a lot of usage experience on Windows as well and
>> there is much less variation there.
>>
>> That is a benefit, but I don't think that is strongly relevant to
>> Hubert's question.
>>
>> The intent of the proposal is to grant additional permissions for
>> implementations to alter behavior based on where the output is directed.
>> Implementation experience only exists for that for Windows, but the wording
>> is (intentionally) written to be agnostic to implementation. Thus, we
>> don't have implementation experience for (all of) this feature outside of
>> Windows at the moment. I understand and appreciate that the proposal is
>> strongly intended to work around a well known Windows deficiency, but it
>> does have applicability elsewhere.
>>
>>
>> > I think the non-Unicode function is awkwardly named.
>>
>> Naming suggestions are welcome!
>>
>>
>> - vprint_unicode()
>> - => vprint_utf8()
>> "unicode" is ambiguous, but the specification is clear that UTF-8
>> is intended.
>> - => u8vprint()
>> => vu8print()
>> I don't recommend these as they may imply char8_t association.
>> - vprint_nonunicode()
>> - => vprint_mojibake()
>> If we want to be honest.
>> - => vprint_polyglot()
>> This feels pretty accurate to me.
>> - => vprint_narrow()
>> This doesn't feel right to me since "narrow" within the standard
>> includes UTF-8.
>>
>>
> Do we have to publically expose these?
>
>
>>
>> -
>>
>> Tom.
>>
>>
>> Cheers,
>> Victor
>>
>> On Wed, May 12, 2021 at 12:51 PM Hubert Tong <
>> hubert.reinterpretcast_at_[hidden]> wrote:
>>
>>> On Wed, May 12, 2021 at 3:14 PM Victor Zverovich <
>>> victor.zverovich_at_[hidden]> wrote:
>>>
>>>> Hi Hubert,
>>>>
>>>> Thanks for the suggestions, I'll try incorporating them in the next
>>>> iteration of the paper.
>>>>
>>>> > I think it would help if the point was stated more explicitly ...
>>>>
>>>> Good idea, will clarify this.
>>>>
>>>> > The paper can at least acknowledge that "polyglot" string literals
>>>> exist ...
>>>>
>>>> Sure.
>>>>
>>>> > we'll end up with cases where the literal encoding is UTF-8 but the
>>>> user won't want the UTF-8 std::print behaviour to potentially kick in.
>>>>
>>>> I am a bit skeptical because I haven't seen any reports about cases
>>>> like this from the extensive usage experience of this feature. We can't fix
>>>> clearly broken things and be bug-to-bug compatible with legacy APIs at the
>>>> same time.
>>>>
>>>> > At least two cases come to mind.
>>>>
>>>> I don't think we can do much if users decide to lie about the encoding.
>>>> We should make the common case work rather than try making everyone happy
>>>> and support theoretical use cases not backed by actual implementation and
>>>> usage experience.
>>>>
>>>
>>> I'm concerned that deployment experience might be limited to specific
>>> environments. I expect the conditions for the second scenario are met very
>>> easily on *nix and also very difficult to test for (requires some sort of
>>> special test environment/harness).
>>>
>>>
>>>> That said they can always use nonunicode function or continue using
>>>> their legacy APIs in those cases.
>>>>
>>>
>>> I think the non-Unicode function is awkwardly named.
>>>
>>>
>>>>
>>>> Cheers,
>>>> Victor
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, May 11, 2021 at 8:23 PM Hubert Tong <
>>>> hubert.reinterpretcast_at_[hidden]> wrote:
>>>>
>>>>> On Tue, May 11, 2021 at 8:41 PM Victor Zverovich via SG16 <
>>>>> sg16_at_[hidden]> wrote:
>>>>>
>>>>>> Dear Unicoders,
>>>>>>
>>>>>> Here is a link to a new revision of P2093:
>>>>>> https://isocpp.org/files/papers/D2093R6.html. It's essentially the
>>>>>> same as R5 but addresses the latest LEWG feedback and adds a few
>>>>>> clarifications. The only change to the wording is replacing <io> with
>>>>>> <print>.
>>>>>>
>>>>>
>>>>> Thanks Victor.
>>>>>
>>>>> With respect to the choice to transcoding, it took me a while to catch
>>>>> on to the statement being made. I think it would help if the point was
>>>>> stated more explicitly that the choice to perform replacement during
>>>>> transcoding is because that is consistent with the treatment of malformed
>>>>> UTF-8 for UTF-8-native terminals and the choice not to transcode in the
>>>>> case where the terminal is UTF-8 native is because we expect the terminal
>>>>> to behave predictably as-is we did do the "transcoding".
>>>>>
>>>>> I'm still not entirely convinced about the argument surrounding the
>>>>> choice of using the literal encoding though. The paper can at least
>>>>> acknowledge that "polyglot" string literals exist and partially obviates
>>>>> the insistence that the literal encoding being UTF-8 according to the build
>>>>> system/build mode means that the involvement of non-UTF-8 strings in the
>>>>> vicinity of std::print constitutes "mixing encodings".
>>>>>
>>>>> I really think that, just for predictability surrounding the display
>>>>> of substitution text, we'll end up with cases where the literal encoding is
>>>>> UTF-8 but the user won't want the UTF-8 std::print behaviour to potentially
>>>>> kick in.
>>>>>
>>>>> At least two cases come to mind:
>>>>> (1) Printing using both legacy interfaces and std::print where the
>>>>> legacy interfaces are not using UTF-8 may appear fine on some terminals but
>>>>> would result, on redirect, in output with mixed encoding.
>>>>>
>>>>> (2) std::print where the literal encoding is UTF-8 but the literals
>>>>> are all "polyglot" and substitution strings that are not UTF-8 can appear
>>>>> to be okay when redirecting or printing to non-Unicode terminals; however,
>>>>> once deployed to a Unicode terminal, replacement characters show up (even
>>>>> if the output is properly encoded for the underlying C output interface).
>>>>>
>>>>>
>>>>>>
>>>>>> Cheers,
>>>>>> Victor
>>>>>>
>>>>>> On Tue, May 11, 2021 at 11:02 AM Tom Honermann via SG16 <
>>>>>> sg16_at_[hidden]> wrote:
>>>>>>
>>>>>>> Reminder that this meeting is taking place tomorrow.
>>>>>>>
>>>>>>> Per suggestion by Peter, the agenda order is being changed to review
>>>>>>> the updates in P2295R2 before D2372R1 and P2093R5 in the hopes that we can
>>>>>>> forward P2295R2 to EWG. We'll try to limit that discussion to 30 minutes.
>>>>>>> The updated agenda is below. Again, we are unlikely to get to P2348R0 at
>>>>>>> all.
>>>>>>>
>>>>>>> - P2295R2: Support for UTF-8 as a portable source file encoding
>>>>>>> <https://wg21.link/p2295r3>
>>>>>>> - Review updates intended to address prior SG16 feedback.
>>>>>>> - D2372R1: Fixing locale handling in chrono formatters
>>>>>>> <https://isocpp.org/files/papers/D2372R1.html>
>>>>>>> - Affirm or rebut LEWGs position.
>>>>>>> - P2093R5: Formatted output <https://wg21.link/p2093r5>
>>>>>>> - Discuss locale dependent character encoding concerns.
>>>>>>> - P2348R0: Whitespaces Wording Revamp
>>>>>>> <https://wg21.link/p2348r0>
>>>>>>>
>>>>>>> Tom.
>>>>>>>
>>>>>>> On 5/4/21 12:06 AM, Tom Honermann via SG16 wrote:
>>>>>>>
>>>>>>> SG16 will hold a telecon on Wednesday, May 12th at 19:30 UTC (timezone
>>>>>>> conversion
>>>>>>> <https://www.timeanddate.com/worldclock/converter.html?iso=20210512T193000&p1=1440&p2=tz_pdt&p3=tz_mdt&p4=tz_cdt&p5=tz_edt&p6=tz_cest>
>>>>>>> ).
>>>>>>>
>>>>>>> The agenda is:
>>>>>>>
>>>>>>> - D2372R1: Fixing locale handling in chrono formatters
>>>>>>> <https://isocpp.org/files/papers/D2372R1.html>
>>>>>>> - Affirm or rebut LEWGs position.
>>>>>>> - P2093R5: Formatted output <https://wg21.link/p2093r5>
>>>>>>> - Discuss locale dependent character encoding concerns.
>>>>>>> - P2295R2: Support for UTF-8 as a portable source file
>>>>>>> encoding <https://wg21.link/p2295r3>
>>>>>>> - Review updates intended to address prior SG16 feedback.
>>>>>>> - P2348R0: Whitespaces Wording Revamp
>>>>>>> <https://wg21.link/p2348r0>
>>>>>>>
>>>>>>> Our last telecon was consumed by discussion
>>>>>>> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README.md#april-28th-2021>
>>>>>>> of LWG3547 <https://cplusplus.github.io/LWG/issue3547> and possible
>>>>>>> remedies. Though we did not reach consensus on a direction forward during
>>>>>>> that telecon, Victor and Corentin, at the LEWG chair's request, drafted
>>>>>>> D2372R0, presented it at the LEWG telecon held 2021-05-03
>>>>>>> <https://wiki.edg.com/bin/view/Wg21telecons2021/P2372#2021-05-03>,
>>>>>>> and LEWG reached strong consensus for it. The D2372R0 revision will be
>>>>>>> submitted for the May mailing as P2372R0; and a D2372R1
>>>>>>> <https://isocpp.org/files/papers/D2372R1.html> revision addressing
>>>>>>> LEWG feedback will be submitted as P2372R1. Both revisions substantially
>>>>>>> match the proposed resolution that SG16 discussed. Since SG16 did not
>>>>>>> reach consensus on that direction, the LEWG chair has asked that we revisit
>>>>>>> it to either affirm or rebut the LEWG consensus. We will therefore
>>>>>>> (briefly) discuss and then poll that direction. Note that the poll taken
>>>>>>> in SG16 differs from the poll taken in LEWG. In SG16, we polled applying
>>>>>>> the proposed resolution to C++23 while LEWG polled applying the proposed
>>>>>>> resolution (with amendments to not change behavior for iostream
>>>>>>> manipulators) to C++23 *and* retroactively to C++20.
>>>>>>>
>>>>>>> Once we've dispatched D2372R1, we'll return to the original agenda
>>>>>>> for our last telecon; discussion of P2093R5
>>>>>>> <https://wg21.link/p2093r5> (Formatted output) and P2295R2
>>>>>>> <https://wg21.link/p2295r3> (Support for UTF-8 as a portable source
>>>>>>> file encoding). I've retained P2348R0 <https://wg21.link/p2348r0>
>>>>>>> on the agenda, though I don't expect that we'll get to it.
>>>>>>>
>>>>>>> With regard to P2093R5 <https://wg21.link/p2093r5>, the current
>>>>>>> status is that LEWG has referred the paper back to SG16 for further
>>>>>>> discussion; please see the LEWG meeting minutes here
>>>>>>> <https://wiki.edg.com/bin/view/Wg21telecons2021/P2093#Library-Evolution-2021-04-06>.
>>>>>>> Specifically, LEWG would benefit from additional analysis of previously
>>>>>>> deferred questions
>>>>>>> <http://lists.isocpp.org/lib-ext/2021/03/18189.php> regarding
>>>>>>> character encoding concerns, transcoding requirements (or the lack there
>>>>>>> of) and the ensuing consequences (or lack there of).
>>>>>>>
>>>>>>> 1. How errors in transcoding should be handled. E.g., when
>>>>>>> transcoding from UTF-8 to a UTF-16 based console interface and the UTF-8
>>>>>>> input is not well-formed.
>>>>>>> 2. The choice to base behavior on the compile-time choice of
>>>>>>> literal encoding. An implication of the current proposal is that a program
>>>>>>> that contains only ASCII characters in string literals will change behavior
>>>>>>> depending on whether the literal encoding is UTF-8 vs ASCII (or some other
>>>>>>> ASCII derived encoding).
>>>>>>> 3. Whether transcoding to the console interface encoding should
>>>>>>> be performed when the literal encoding is not UTF-8.
>>>>>>> 4. What the implications are for future support of std::print("{}
>>>>>>> {} {} {}", L"Wide text", u8"UTF-8 text", u"UTF-16 text",
>>>>>>> U"UTF-32 text").
>>>>>>>
>>>>>>> I think these concerns will be easier to resolve if we first reach
>>>>>>> consensus regarding scenarios in which localized text may be provided in an
>>>>>>> unexpected encoding. The following is a slightly modified example of code
>>>>>>> Hubert previously provided. The example has been modified to explicitly
>>>>>>> opt into localized chrono formatting.
>>>>>>>
>>>>>>> std::print("{:L%p}\n",
>>>>>>> std::chrono::system_clock::now().time_since_epoch());
>>>>>>>
>>>>>>> At issue is the encoding used by locale sensitive chrono
>>>>>>> formatters. The example above contains the %p specifier and is
>>>>>>> locale sensitive because AM/PM designations may be localized. In a Chinese
>>>>>>> locale the desired translation of "PM" is "下午", but the locale will provide
>>>>>>> the translation in the locale encoding. As specified in P2093R5, if the
>>>>>>> literal encoding is UTF-8, than std::print() will expect the
>>>>>>> translation to be provided in UTF-8, but if the locale is not UTF-8-based
>>>>>>> (e.g., Big5; perhaps Shift-JIS for the Japanese 午後 translation), then the
>>>>>>> result is mojibake.
>>>>>>>
>>>>>>> I had previously suggested the following possible directions we can
>>>>>>> investigate to resolve the encoding concerns.
>>>>>>>
>>>>>>> - Specialize std::locale facets
>>>>>>> <https://en.cppreference.com/w/cpp/locale/locale> and related
>>>>>>> I/O manipulators like std::put_time()
>>>>>>> <https://en.cppreference.com/w/cpp/io/manip/put_time> for char8_t.
>>>>>>> This would allow std::print() to, when the literal encoding is
>>>>>>> UTF-8, opt-in to use of the UTF-8/char8_t facets and I/O
>>>>>>> manipulators.
>>>>>>> - When the literal encoding is UTF-8, stipulate that running the
>>>>>>> program in a non-UTF-8 based locale is non-conforming. This would
>>>>>>> effectively require MSVC programmers to, when building code with the
>>>>>>> /utf-8 option, to also force selection of a UTF-8 code page via
>>>>>>> a manifest
>>>>>>> <https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page>
>>>>>>> and require use of Windows 10 build 1903 or later.
>>>>>>> - When the literal encoding is UTF-8, specify that non-UTF-8
>>>>>>> based locale dependent translations be implicitly transcoded (such
>>>>>>> transcoding should never result in errors except perhaps for memory
>>>>>>> allocation failures).
>>>>>>> - Drop the special case handling for the literal encoding being
>>>>>>> UTF-8 and specify that, when bypassing a stream to write directly to the
>>>>>>> console, that the output be implicitly transcoded from the current locale
>>>>>>> dependent encoding (whatever it is) to the console encoding (UTF-8).
>>>>>>>
>>>>>>> If we get through all of that, we'll review Corentin's updates in
>>>>>>> P2295R2 <https://wg21.link/p2295r3> to address prior SG16
>>>>>>> feedback. Thank you to everyone that already provided additional feedback
>>>>>>> on the mailing list!
>>>>>>>
>>>>>>> Tom.
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> SG16 mailing list
>>>>>>> SG16_at_[hidden]
>>>>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>>>>>
>>>>>> --
>>>>>> SG16 mailing list
>>>>>> SG16_at_[hidden]
>>>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>>>>
>>>>>
>>
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2021-05-26 07:49:27