C++ Logo

sg16

Advanced search

Re: [SG16] Agenda for the 2021-05-12 SG16 telecon

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Wed, 26 May 2021 14:55:02 +0200
On Wed, May 26, 2021 at 2:49 PM Victor Zverovich <victor.zverovich_at_[hidden]>
wrote:

> > Do we have to publically expose these?
>
> Yes, they are useful for library writers in the same way vformat overloads
> are.
>

What if there is a single public overload?

print
vprint
__vprint_unicode
__vprint

I don't know that users can meaningfully use vprint_(non)unicode.


>
> - Victor
>
> On Tue, May 25, 2021 at 11:59 AM Corentin Jabot via SG16 <
> sg16_at_[hidden]> wrote:
>
>>
>>
>> On Thu, May 20, 2021 at 12:34 AM Tom Honermann via SG16 <
>> sg16_at_[hidden]> wrote:
>>
>>> On 5/19/21 2:17 PM, Victor Zverovich via SG16 wrote:
>>>
>>> > I'm concerned that deployment experience might be limited to specific
>>> environments.
>>>
>>> The great thing about the current implementation (and proposal) is that
>>> it is consistent with printf's behavior on common C standard library
>>> implementations on non-Windows platforms, so we have all the deployment
>>> experience in the world. This is also why I am reluctant to innovate in
>>> this area. There has been a lot of usage experience on Windows as well and
>>> there is much less variation there.
>>>
>>> That is a benefit, but I don't think that is strongly relevant to
>>> Hubert's question.
>>>
>>> The intent of the proposal is to grant additional permissions for
>>> implementations to alter behavior based on where the output is directed.
>>> Implementation experience only exists for that for Windows, but the wording
>>> is (intentionally) written to be agnostic to implementation. Thus, we
>>> don't have implementation experience for (all of) this feature outside of
>>> Windows at the moment. I understand and appreciate that the proposal is
>>> strongly intended to work around a well known Windows deficiency, but it
>>> does have applicability elsewhere.
>>>
>>>
>>> > I think the non-Unicode function is awkwardly named.
>>>
>>> Naming suggestions are welcome!
>>>
>>>
>>> - vprint_unicode()
>>> - => vprint_utf8()
>>> "unicode" is ambiguous, but the specification is clear that UTF-8
>>> is intended.
>>> - => u8vprint()
>>> => vu8print()
>>> I don't recommend these as they may imply char8_t association.
>>> - vprint_nonunicode()
>>> - => vprint_mojibake()
>>> If we want to be honest.
>>> - => vprint_polyglot()
>>> This feels pretty accurate to me.
>>> - => vprint_narrow()
>>> This doesn't feel right to me since "narrow" within the standard
>>> includes UTF-8.
>>>
>>>
>> Do we have to publically expose these?
>>
>>
>>>
>>> -
>>>
>>> Tom.
>>>
>>>
>>> Cheers,
>>> Victor
>>>
>>> On Wed, May 12, 2021 at 12:51 PM Hubert Tong <
>>> hubert.reinterpretcast_at_[hidden]> wrote:
>>>
>>>> On Wed, May 12, 2021 at 3:14 PM Victor Zverovich <
>>>> victor.zverovich_at_[hidden]> wrote:
>>>>
>>>>> Hi Hubert,
>>>>>
>>>>> Thanks for the suggestions, I'll try incorporating them in the next
>>>>> iteration of the paper.
>>>>>
>>>>> > I think it would help if the point was stated more explicitly ...
>>>>>
>>>>> Good idea, will clarify this.
>>>>>
>>>>> > The paper can at least acknowledge that "polyglot" string literals
>>>>> exist ...
>>>>>
>>>>> Sure.
>>>>>
>>>>> > we'll end up with cases where the literal encoding is UTF-8 but the
>>>>> user won't want the UTF-8 std::print behaviour to potentially kick in.
>>>>>
>>>>> I am a bit skeptical because I haven't seen any reports about cases
>>>>> like this from the extensive usage experience of this feature. We can't fix
>>>>> clearly broken things and be bug-to-bug compatible with legacy APIs at the
>>>>> same time.
>>>>>
>>>>> > At least two cases come to mind.
>>>>>
>>>>> I don't think we can do much if users decide to lie about the
>>>>> encoding. We should make the common case work rather than try making
>>>>> everyone happy and support theoretical use cases not backed by actual
>>>>> implementation and usage experience.
>>>>>
>>>>
>>>> I'm concerned that deployment experience might be limited to specific
>>>> environments. I expect the conditions for the second scenario are met very
>>>> easily on *nix and also very difficult to test for (requires some sort of
>>>> special test environment/harness).
>>>>
>>>>
>>>>> That said they can always use nonunicode function or continue using
>>>>> their legacy APIs in those cases.
>>>>>
>>>>
>>>> I think the non-Unicode function is awkwardly named.
>>>>
>>>>
>>>>>
>>>>> Cheers,
>>>>> Victor
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, May 11, 2021 at 8:23 PM Hubert Tong <
>>>>> hubert.reinterpretcast_at_[hidden]> wrote:
>>>>>
>>>>>> On Tue, May 11, 2021 at 8:41 PM Victor Zverovich via SG16 <
>>>>>> sg16_at_[hidden]> wrote:
>>>>>>
>>>>>>> Dear Unicoders,
>>>>>>>
>>>>>>> Here is a link to a new revision of P2093:
>>>>>>> https://isocpp.org/files/papers/D2093R6.html. It's essentially the
>>>>>>> same as R5 but addresses the latest LEWG feedback and adds a few
>>>>>>> clarifications. The only change to the wording is replacing <io> with
>>>>>>> <print>.
>>>>>>>
>>>>>>
>>>>>> Thanks Victor.
>>>>>>
>>>>>> With respect to the choice to transcoding, it took me a while to
>>>>>> catch on to the statement being made. I think it would help if the point
>>>>>> was stated more explicitly that the choice to perform replacement during
>>>>>> transcoding is because that is consistent with the treatment of malformed
>>>>>> UTF-8 for UTF-8-native terminals and the choice not to transcode in the
>>>>>> case where the terminal is UTF-8 native is because we expect the terminal
>>>>>> to behave predictably as-is we did do the "transcoding".
>>>>>>
>>>>>> I'm still not entirely convinced about the argument surrounding the
>>>>>> choice of using the literal encoding though. The paper can at least
>>>>>> acknowledge that "polyglot" string literals exist and partially obviates
>>>>>> the insistence that the literal encoding being UTF-8 according to the build
>>>>>> system/build mode means that the involvement of non-UTF-8 strings in the
>>>>>> vicinity of std::print constitutes "mixing encodings".
>>>>>>
>>>>>> I really think that, just for predictability surrounding the display
>>>>>> of substitution text, we'll end up with cases where the literal encoding is
>>>>>> UTF-8 but the user won't want the UTF-8 std::print behaviour to potentially
>>>>>> kick in.
>>>>>>
>>>>>> At least two cases come to mind:
>>>>>> (1) Printing using both legacy interfaces and std::print where the
>>>>>> legacy interfaces are not using UTF-8 may appear fine on some terminals but
>>>>>> would result, on redirect, in output with mixed encoding.
>>>>>>
>>>>>> (2) std::print where the literal encoding is UTF-8 but the literals
>>>>>> are all "polyglot" and substitution strings that are not UTF-8 can appear
>>>>>> to be okay when redirecting or printing to non-Unicode terminals; however,
>>>>>> once deployed to a Unicode terminal, replacement characters show up (even
>>>>>> if the output is properly encoded for the underlying C output interface).
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Victor
>>>>>>>
>>>>>>> On Tue, May 11, 2021 at 11:02 AM Tom Honermann via SG16 <
>>>>>>> sg16_at_[hidden]> wrote:
>>>>>>>
>>>>>>>> Reminder that this meeting is taking place tomorrow.
>>>>>>>>
>>>>>>>> Per suggestion by Peter, the agenda order is being changed to
>>>>>>>> review the updates in P2295R2 before D2372R1 and P2093R5 in the hopes that
>>>>>>>> we can forward P2295R2 to EWG. We'll try to limit that discussion to 30
>>>>>>>> minutes. The updated agenda is below. Again, we are unlikely to get to
>>>>>>>> P2348R0 at all.
>>>>>>>>
>>>>>>>> - P2295R2: Support for UTF-8 as a portable source file encoding
>>>>>>>> <https://wg21.link/p2295r3>
>>>>>>>> - Review updates intended to address prior SG16 feedback.
>>>>>>>> - D2372R1: Fixing locale handling in chrono formatters
>>>>>>>> <https://isocpp.org/files/papers/D2372R1.html>
>>>>>>>> - Affirm or rebut LEWGs position.
>>>>>>>> - P2093R5: Formatted output <https://wg21.link/p2093r5>
>>>>>>>> - Discuss locale dependent character encoding concerns.
>>>>>>>> - P2348R0: Whitespaces Wording Revamp
>>>>>>>> <https://wg21.link/p2348r0>
>>>>>>>>
>>>>>>>> Tom.
>>>>>>>>
>>>>>>>> On 5/4/21 12:06 AM, Tom Honermann via SG16 wrote:
>>>>>>>>
>>>>>>>> SG16 will hold a telecon on Wednesday, May 12th at 19:30 UTC (timezone
>>>>>>>> conversion
>>>>>>>> <https://www.timeanddate.com/worldclock/converter.html?iso=20210512T193000&p1=1440&p2=tz_pdt&p3=tz_mdt&p4=tz_cdt&p5=tz_edt&p6=tz_cest>
>>>>>>>> ).
>>>>>>>>
>>>>>>>> The agenda is:
>>>>>>>>
>>>>>>>> - D2372R1: Fixing locale handling in chrono formatters
>>>>>>>> <https://isocpp.org/files/papers/D2372R1.html>
>>>>>>>> - Affirm or rebut LEWGs position.
>>>>>>>> - P2093R5: Formatted output <https://wg21.link/p2093r5>
>>>>>>>> - Discuss locale dependent character encoding concerns.
>>>>>>>> - P2295R2: Support for UTF-8 as a portable source file
>>>>>>>> encoding <https://wg21.link/p2295r3>
>>>>>>>> - Review updates intended to address prior SG16 feedback.
>>>>>>>> - P2348R0: Whitespaces Wording Revamp
>>>>>>>> <https://wg21.link/p2348r0>
>>>>>>>>
>>>>>>>> Our last telecon was consumed by discussion
>>>>>>>> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README.md#april-28th-2021>
>>>>>>>> of LWG3547 <https://cplusplus.github.io/LWG/issue3547> and
>>>>>>>> possible remedies. Though we did not reach consensus on a direction
>>>>>>>> forward during that telecon, Victor and Corentin, at the LEWG chair's
>>>>>>>> request, drafted D2372R0, presented it at the LEWG telecon held
>>>>>>>> 2021-05-03
>>>>>>>> <https://wiki.edg.com/bin/view/Wg21telecons2021/P2372#2021-05-03>,
>>>>>>>> and LEWG reached strong consensus for it. The D2372R0 revision will be
>>>>>>>> submitted for the May mailing as P2372R0; and a D2372R1
>>>>>>>> <https://isocpp.org/files/papers/D2372R1.html> revision addressing
>>>>>>>> LEWG feedback will be submitted as P2372R1. Both revisions substantially
>>>>>>>> match the proposed resolution that SG16 discussed. Since SG16 did not
>>>>>>>> reach consensus on that direction, the LEWG chair has asked that we revisit
>>>>>>>> it to either affirm or rebut the LEWG consensus. We will therefore
>>>>>>>> (briefly) discuss and then poll that direction. Note that the poll taken
>>>>>>>> in SG16 differs from the poll taken in LEWG. In SG16, we polled applying
>>>>>>>> the proposed resolution to C++23 while LEWG polled applying the proposed
>>>>>>>> resolution (with amendments to not change behavior for iostream
>>>>>>>> manipulators) to C++23 *and* retroactively to C++20.
>>>>>>>>
>>>>>>>> Once we've dispatched D2372R1, we'll return to the original agenda
>>>>>>>> for our last telecon; discussion of P2093R5
>>>>>>>> <https://wg21.link/p2093r5> (Formatted output) and P2295R2
>>>>>>>> <https://wg21.link/p2295r3> (Support for UTF-8 as a portable
>>>>>>>> source file encoding). I've retained P2348R0
>>>>>>>> <https://wg21.link/p2348r0> on the agenda, though I don't expect
>>>>>>>> that we'll get to it.
>>>>>>>>
>>>>>>>> With regard to P2093R5 <https://wg21.link/p2093r5>, the current
>>>>>>>> status is that LEWG has referred the paper back to SG16 for further
>>>>>>>> discussion; please see the LEWG meeting minutes here
>>>>>>>> <https://wiki.edg.com/bin/view/Wg21telecons2021/P2093#Library-Evolution-2021-04-06>.
>>>>>>>> Specifically, LEWG would benefit from additional analysis of previously
>>>>>>>> deferred questions
>>>>>>>> <http://lists.isocpp.org/lib-ext/2021/03/18189.php> regarding
>>>>>>>> character encoding concerns, transcoding requirements (or the lack there
>>>>>>>> of) and the ensuing consequences (or lack there of).
>>>>>>>>
>>>>>>>> 1. How errors in transcoding should be handled. E.g., when
>>>>>>>> transcoding from UTF-8 to a UTF-16 based console interface and the UTF-8
>>>>>>>> input is not well-formed.
>>>>>>>> 2. The choice to base behavior on the compile-time choice of
>>>>>>>> literal encoding. An implication of the current proposal is that a program
>>>>>>>> that contains only ASCII characters in string literals will change behavior
>>>>>>>> depending on whether the literal encoding is UTF-8 vs ASCII (or some other
>>>>>>>> ASCII derived encoding).
>>>>>>>> 3. Whether transcoding to the console interface encoding should
>>>>>>>> be performed when the literal encoding is not UTF-8.
>>>>>>>> 4. What the implications are for future support of std::print("{}
>>>>>>>> {} {} {}", L"Wide text", u8"UTF-8 text", u"UTF-16 text",
>>>>>>>> U"UTF-32 text").
>>>>>>>>
>>>>>>>> I think these concerns will be easier to resolve if we first reach
>>>>>>>> consensus regarding scenarios in which localized text may be provided in an
>>>>>>>> unexpected encoding. The following is a slightly modified example of code
>>>>>>>> Hubert previously provided. The example has been modified to explicitly
>>>>>>>> opt into localized chrono formatting.
>>>>>>>>
>>>>>>>> std::print("{:L%p}\n",
>>>>>>>> std::chrono::system_clock::now().time_since_epoch());
>>>>>>>>
>>>>>>>> At issue is the encoding used by locale sensitive chrono
>>>>>>>> formatters. The example above contains the %p specifier and is
>>>>>>>> locale sensitive because AM/PM designations may be localized. In a Chinese
>>>>>>>> locale the desired translation of "PM" is "下午", but the locale will provide
>>>>>>>> the translation in the locale encoding. As specified in P2093R5, if the
>>>>>>>> literal encoding is UTF-8, than std::print() will expect the
>>>>>>>> translation to be provided in UTF-8, but if the locale is not UTF-8-based
>>>>>>>> (e.g., Big5; perhaps Shift-JIS for the Japanese 午後 translation), then the
>>>>>>>> result is mojibake.
>>>>>>>>
>>>>>>>> I had previously suggested the following possible directions we can
>>>>>>>> investigate to resolve the encoding concerns.
>>>>>>>>
>>>>>>>> - Specialize std::locale facets
>>>>>>>> <https://en.cppreference.com/w/cpp/locale/locale> and related
>>>>>>>> I/O manipulators like std::put_time()
>>>>>>>> <https://en.cppreference.com/w/cpp/io/manip/put_time> for
>>>>>>>> char8_t. This would allow std::print() to, when the literal
>>>>>>>> encoding is UTF-8, opt-in to use of the UTF-8/char8_t facets
>>>>>>>> and I/O manipulators.
>>>>>>>> - When the literal encoding is UTF-8, stipulate that running
>>>>>>>> the program in a non-UTF-8 based locale is non-conforming. This would
>>>>>>>> effectively require MSVC programmers to, when building code with the
>>>>>>>> /utf-8 option, to also force selection of a UTF-8 code page via
>>>>>>>> a manifest
>>>>>>>> <https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page>
>>>>>>>> and require use of Windows 10 build 1903 or later.
>>>>>>>> - When the literal encoding is UTF-8, specify that non-UTF-8
>>>>>>>> based locale dependent translations be implicitly transcoded (such
>>>>>>>> transcoding should never result in errors except perhaps for memory
>>>>>>>> allocation failures).
>>>>>>>> - Drop the special case handling for the literal encoding being
>>>>>>>> UTF-8 and specify that, when bypassing a stream to write directly to the
>>>>>>>> console, that the output be implicitly transcoded from the current locale
>>>>>>>> dependent encoding (whatever it is) to the console encoding (UTF-8).
>>>>>>>>
>>>>>>>> If we get through all of that, we'll review Corentin's updates in
>>>>>>>> P2295R2 <https://wg21.link/p2295r3> to address prior SG16
>>>>>>>> feedback. Thank you to everyone that already provided additional feedback
>>>>>>>> on the mailing list!
>>>>>>>>
>>>>>>>> Tom.
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> SG16 mailing list
>>>>>>>> SG16_at_[hidden]
>>>>>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>>>>>>
>>>>>>> --
>>>>>>> SG16 mailing list
>>>>>>> SG16_at_[hidden]
>>>>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>>>>>
>>>>>>
>>>
>>> --
>>> SG16 mailing list
>>> SG16_at_[hidden]
>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>

Received on 2021-05-26 07:55:24