sg16: Re: [SG16] Agenda for the 2021-05-12 SG16 telecon

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 19 May 2021 18:33:46 -0400

On 5/19/21 2:17 PM, Victor Zverovich via SG16 wrote:
> > I'm concerned that deployment experience might be limited to
> specific environments.
>
> The great thing about the current implementation (and proposal) is
> that it is consistent with printf's behavior on common C standard
> library implementations on non-Windows platforms, so we have all the
> deployment experience in the world. This is also why I am reluctant to
> innovate in this area. There has been a lot of usage experience on
> Windows as well and there is much less variation there.

That is a benefit, but I don't think that is strongly relevant to
Hubert's question.

The intent of the proposal is to grant additional permissions for
implementations to alter behavior based on where the output is
directed. Implementation experience only exists for that for Windows,
but the wording is (intentionally) written to be agnostic to
implementation. Thus, we don't have implementation experience for (all
of) this feature outside of Windows at the moment. I understand and
appreciate that the proposal is strongly intended to work around a well
known Windows deficiency, but it does have applicability elsewhere.

>
> > I think the non-Unicode function is awkwardly named.
>
> Naming suggestions are welcome!

  * vprint_unicode()
      o => vprint_utf8()
        "unicode" is ambiguous, but the specification is clear that
        UTF-8 is intended.
      o => u8vprint()
        => vu8print()
        I don't recommend these as they may imply char8_t association.
  * vprint_nonunicode()
      o => vprint_mojibake()
        If we want to be honest.
      o => vprint_polyglot()
        This feels pretty accurate to me.
      o => vprint_narrow()
        This doesn't feel right to me since "narrow" within the standard
        includes UTF-8.

Tom.

>
> Cheers,
> Victor
>
> On Wed, May 12, 2021 at 12:51 PM Hubert Tong
> <hubert.reinterpretcast_at_[hidden]
> <mailto:hubert.reinterpretcast_at_[hidden]>> wrote:
>
> On Wed, May 12, 2021 at 3:14 PM Victor Zverovich
> <victor.zverovich_at_[hidden] <mailto:victor.zverovich_at_[hidden]>>
> wrote:
>
> Hi Hubert,
>
> Thanks for the suggestions, I'll try incorporating them in the
> next iteration of the paper.
>
> > I think it would help if the point was stated more
> explicitly ...
>
> Good idea, will clarify this.
>
> > The paper can at least acknowledge that "polyglot" string
> literals exist ...
>
> Sure.
>
> > we'll end up with cases where the literal encoding is UTF-8
> but the user won't want the UTF-8 std::print behaviour to
> potentially kick in.
>
> I am a bit skeptical because I haven't seen any reports about
> cases like this from the extensive usage experience of this
> feature. We can't fix clearly broken things and be bug-to-bug
> compatible with legacy APIs at the same time.
>
> > At least two cases come to mind.
>
> I don't think we can do much if users decide to lie about the
> encoding. We should make the common case work rather than try
> making everyone happy and support theoretical use cases not
> backed by actual implementation and usage experience.
>
>
> I'm concerned that deployment experience might be limited to
> specific environments. I expect the conditions for the second
> scenario are met very easily on *nix and also very difficult to
> test for (requires some sort of special test environment/harness).
>
> That said they can always use nonunicode function or continue
> using their legacy APIs in those cases.
>
>
> I think the non-Unicode function is awkwardly named.
>
>
> Cheers,
> Victor
>
>
>
>
> On Tue, May 11, 2021 at 8:23 PM Hubert Tong
> <hubert.reinterpretcast_at_[hidden]
> <mailto:hubert.reinterpretcast_at_[hidden]>> wrote:
>
> On Tue, May 11, 2021 at 8:41 PM Victor Zverovich via SG16
> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
> Dear Unicoders,
>
> Here is a link to a new revision of P2093:
> https://isocpp.org/files/papers/D2093R6.html. It's
> essentially the same as R5 but addresses the latest
> LEWG feedback and adds a few clarifications. The only
> change to the wording is replacing <io> with <print>.
>
>
> Thanks Victor.
>
> With respect to the choice to transcoding, it took me a
> while to catch on to the statement being made. I think it
> would help if the point was stated more explicitly that
> the choice to perform replacement during transcoding is
> because that is consistent with the treatment of malformed
> UTF-8 for UTF-8-native terminals and the choice not to
> transcode in the case where the terminal is UTF-8 native
> is because we expect the terminal to behave predictably
> as-is we did do the "transcoding".
>
> I'm still not entirely convinced about the argument
> surrounding the choice of using the literal encoding
> though. The paper can at least acknowledge that "polyglot"
> string literals exist and partially obviates the
> insistence that the literal encoding being UTF-8 according
> to the build system/build mode means that the involvement
> of non-UTF-8 strings in the vicinity of std::print
> constitutes "mixing encodings".
>
> I really think that, just for predictability surrounding
> the display of substitution text, we'll end up with cases
> where the literal encoding is UTF-8 but the user won't
> want the UTF-8 std::print behaviour to potentially kick in.
>
> At least two cases come to mind:
> (1) Printing using both legacy interfaces and std::print
> where the legacy interfaces are not using UTF-8 may appear
> fine on some terminals but would result, on redirect, in
> output with mixed encoding.
>
> (2) std::print where the literal encoding is UTF-8 but the
> literals are all "polyglot" and substitution strings that
> are not UTF-8 can appear to be okay when redirecting or
> printing to non-Unicode terminals; however, once deployed
> to a Unicode terminal, replacement characters show up
> (even if the output is properly encoded for the underlying
> C output interface).
>
>
> Cheers,
> Victor
>
> On Tue, May 11, 2021 at 11:02 AM Tom Honermann via
> SG16 <sg16_at_[hidden]
> <mailto:sg16_at_[hidden]>> wrote:
>
> Reminder that this meeting is taking place tomorrow.
>
> Per suggestion by Peter, the agenda order is being
> changed to review the updates in P2295R2 before
> D2372R1 and P2093R5 in the hopes that we can
> forward P2295R2 to EWG. We'll try to limit that
> discussion to 30 minutes. The updated agenda is
> below. Again, we are unlikely to get to P2348R0
> at all.
>
> * P2295R2: Support for UTF-8 as a portable
> source file encoding <https://wg21.link/p2295r3>
> o Review updates intended to address prior
> SG16 feedback.
> * D2372R1: Fixing locale handling in chrono
> formatters
> <https://isocpp.org/files/papers/D2372R1.html>
> o Affirm or rebut LEWGs position.
> * P2093R5: Formatted output
> <https://wg21.link/p2093r5>
> o Discuss locale dependent character
> encoding concerns.
> * P2348R0: Whitespaces Wording Revamp
> <https://wg21.link/p2348r0>
>
> Tom.
>
> On 5/4/21 12:06 AM, Tom Honermann via SG16 wrote:
>>
>> SG16 will hold a telecon on Wednesday, May 12th
>> at 19:30 UTC (timezone conversion
>> <https://www.timeanddate.com/worldclock/converter.html?iso=20210512T193000&p1=1440&p2=tz_pdt&p3=tz_mdt&p4=tz_cdt&p5=tz_edt&p6=tz_cest>).
>>
>> The agenda is:
>>
>> * D2372R1: Fixing locale handling in chrono
>> formatters
>> <https://isocpp.org/files/papers/D2372R1.html>
>> o Affirm or rebut LEWGs position.
>> * P2093R5: Formatted output
>> <https://wg21.link/p2093r5>
>> o Discuss locale dependent character
>> encoding concerns.
>> * P2295R2: Support for UTF-8 as a portable
>> source file encoding <https://wg21.link/p2295r3>
>> o Review updates intended to address prior
>> SG16 feedback.
>> * P2348R0: Whitespaces Wording Revamp
>> <https://wg21.link/p2348r0>
>>
>> Our last telecon was consumed by discussion
>> <https://github.com/sg16-unicode/sg16-meetings/blob/master/README.md#april-28th-2021>
>> of LWG3547
>> <https://cplusplus.github.io/LWG/issue3547> and
>> possible remedies. Though we did not reach
>> consensus on a direction forward during that
>> telecon, Victor and Corentin, at the LEWG chair's
>> request, drafted D2372R0, presented it at the
>> LEWG telecon held 2021-05-03
>> <https://wiki.edg.com/bin/view/Wg21telecons2021/P2372#2021-05-03>,
>> and LEWG reached strong consensus for it. The
>> D2372R0 revision will be submitted for the May
>> mailing as P2372R0; and a D2372R1
>> <https://isocpp.org/files/papers/D2372R1.html>
>> revision addressing LEWG feedback will be
>> submitted as P2372R1. Both revisions
>> substantially match the proposed resolution that
>> SG16 discussed. Since SG16 did not reach
>> consensus on that direction, the LEWG chair has
>> asked that we revisit it to either affirm or
>> rebut the LEWG consensus. We will therefore
>> (briefly) discuss and then poll that direction.
>> Note that the poll taken in SG16 differs from the
>> poll taken in LEWG. In SG16, we polled applying
>> the proposed resolution to C++23 while LEWG
>> polled applying the proposed resolution (with
>> amendments to not change behavior for iostream
>> manipulators) to C++23 *and* retroactively to C++20.
>>
>> Once we've dispatched D2372R1, we'll return to
>> the original agenda for our last telecon;
>> discussion of P2093R5 <https://wg21.link/p2093r5>
>> (Formatted output) and P2295R2
>> <https://wg21.link/p2295r3> (Support for UTF-8 as
>> a portable source file encoding). I've retained
>> P2348R0 <https://wg21.link/p2348r0> on the
>> agenda, though I don't expect that we'll get to it.
>>
>> With regard to P2093R5
>> <https://wg21.link/p2093r5>, the current status
>> is that LEWG has referred the paper back to SG16
>> for further discussion; please see the LEWG
>> meeting minutes here
>> <https://wiki.edg.com/bin/view/Wg21telecons2021/P2093#Library-Evolution-2021-04-06>.
>> Specifically, LEWG would benefit from additional
>> analysis of previously deferred questions
>> <http://lists.isocpp.org/lib-ext/2021/03/18189.php>
>> regarding character encoding concerns,
>> transcoding requirements (or the lack there of)
>> and the ensuing consequences (or lack there of).
>>
>> 1. How errors in transcoding should be handled.
>> E.g., when transcoding from UTF-8 to a UTF-16
>> based console interface and the UTF-8 input
>> is not well-formed.
>> 2. The choice to base behavior on the
>> compile-time choice of literal encoding. An
>> implication of the current proposal is that a
>> program that contains only ASCII characters
>> in string literals will change behavior
>> depending on whether the literal encoding is
>> UTF-8 vs ASCII (or some other ASCII derived
>> encoding).
>> 3. Whether transcoding to the console interface
>> encoding should be performed when the literal
>> encoding is not UTF-8.
>> 4. What the implications are for future support
>> of std::print("{} {} {}{}", L"Wide text",
>> u8"UTF-8 text", u"UTF-16 text", U"UTF-32 text").
>>
>> I think these concerns will be easier to resolve
>> if we first reach consensus regarding scenarios
>> in which localized text may be provided in an
>> unexpected encoding. The following is a slightly
>> modified example of code Hubert previously
>> provided. The example has been modified to
>> explicitly opt into localized chrono formatting.
>>
>> std::print("{:L%p}\n",
>> std::chrono::system_clock::now().time_since_epoch());
>>
>> At issue is the encoding used by locale sensitive
>> chrono formatters. The example above contains the
>> %p specifier and is locale sensitive because
>> AM/PM designations may be localized. In a
>> Chinese locale the desired translation of "PM" is
>> "下午", but the locale will provide the translation
>> in the locale encoding. As specified in P2093R5,
>> if the literal encoding is UTF-8, than
>> std::print() will expect the translation to be
>> provided in UTF-8, but if the locale is not
>> UTF-8-based (e.g., Big5; perhaps Shift-JIS for
>> the Japanese 午後 translation), then the result is
>> mojibake.
>>
>> I had previously suggested the following possible
>> directions we can investigate to resolve the
>> encoding concerns.
>>
>> * Specialize std::locale facets
>> <https://en.cppreference.com/w/cpp/locale/locale>
>> and related I/O manipulators like
>> std::put_time()
>> <https://en.cppreference.com/w/cpp/io/manip/put_time>
>> for char8_t. This would allow std::print()
>> to, when the literal encoding is UTF-8,
>> opt-in to use of the UTF-8/char8_t facets and
>> I/O manipulators.
>> * When the literal encoding is UTF-8, stipulate
>> that running the program in a non-UTF-8 based
>> locale is non-conforming. This would
>> effectively require MSVC programmers to, when
>> building code with the /utf-8 option, to also
>> force selection of a UTF-8 code page via a
>> manifest
>> <https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page>
>> and require use of Windows 10 build 1903 or
>> later.
>> * When the literal encoding is UTF-8, specify
>> that non-UTF-8 based locale dependent
>> translations be implicitly transcoded (such
>> transcoding should never result in errors
>> except perhaps for memory allocation failures).
>> * Drop the special case handling for the
>> literal encoding being UTF-8 and specify
>> that, when bypassing a stream to write
>> directly to the console, that the output be
>> implicitly transcoded from the current locale
>> dependent encoding (whatever it is) to the
>> console encoding (UTF-8).
>>
>> If we get through all of that, we'll review
>> Corentin's updates in P2295R2
>> <https://wg21.link/p2295r3> to address prior SG16
>> feedback. Thank you to everyone that already
>> provided additional feedback on the mailing list!
>>
>> Tom.
>>
>>
>
> --
> SG16 mailing list
> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
> --
> SG16 mailing list
> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
>

Received on 2021-05-19 17:33:52