Hi Hubert,

Thanks for the suggestions, I'll try incorporating them in the next iteration of the paper.

> I think it would help if the point was stated more explicitly ...

Good idea, will clarify this.

>  The paper can at least acknowledge that "polyglot" string literals exist ...

Sure.

> we'll end up with cases where the literal encoding is UTF-8 but the user won't want the UTF-8 std::print behaviour to potentially kick in.

I am a bit skeptical because I haven't seen any reports about cases like this from the extensive usage experience of this feature. We can't fix clearly broken things and be bug-to-bug compatible with legacy APIs at the same time.

> At least two cases come to mind.

I don't think we can do much if users decide to lie about the encoding. We should make the common case work rather than try making everyone happy and support theoretical use cases not backed by actual implementation and usage experience. That said they can always use nonunicode function or continue using their legacy APIs in those cases.

Cheers,
Victor




On Tue, May 11, 2021 at 8:23 PM Hubert Tong <hubert.reinterpretcast@gmail.com> wrote:
On Tue, May 11, 2021 at 8:41 PM Victor Zverovich via SG16 <sg16@lists.isocpp.org> wrote:
Dear Unicoders,

Here is a link to a new revision of P2093: https://isocpp.org/files/papers/D2093R6.html. It's essentially the same as R5 but addresses the latest LEWG feedback and adds a few clarifications. The only change to the wording is replacing <io> with <print>.

Thanks Victor.

With respect to the choice to transcoding, it took me a while to catch on to the statement being made. I think it would help if the point was stated more explicitly that the choice to perform replacement during transcoding is because that is consistent with the treatment of malformed UTF-8 for UTF-8-native terminals and the choice not to transcode in the case where the terminal is UTF-8 native is because we expect the terminal to behave predictably as-is we did do the "transcoding".

I'm still not entirely convinced about the argument surrounding the choice of using the literal encoding though. The paper can at least acknowledge that "polyglot" string literals exist and partially obviates the insistence that the literal encoding being UTF-8 according to the build system/build mode means that the involvement of non-UTF-8 strings in the vicinity of std::print constitutes "mixing encodings".

I really think that, just for predictability surrounding the display of substitution text, we'll end up with cases where the literal encoding is UTF-8 but the user won't want the UTF-8 std::print behaviour to potentially kick in.

At least two cases come to mind:
(1) Printing using both legacy interfaces and std::print where the legacy interfaces are not using UTF-8 may appear fine on some terminals but would result, on redirect, in output with mixed encoding.

(2) std::print where the literal encoding is UTF-8 but the literals are all "polyglot" and substitution strings that are not UTF-8 can appear to be okay when redirecting or printing to non-Unicode terminals; however, once deployed to a Unicode terminal, replacement characters show up (even if the output is properly encoded for the underlying C output interface).
 

Cheers,
Victor

On Tue, May 11, 2021 at 11:02 AM Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:
Reminder that this meeting is taking place tomorrow.

Per suggestion by Peter, the agenda order is being changed to review the updates in P2295R2 before D2372R1 and P2093R5 in the hopes that we can forward P2295R2 to EWG.  We'll try to limit that discussion to 30 minutes.  The updated agenda is below.  Again, we are unlikely to get to P2348R0 at all.
Tom.

On 5/4/21 12:06 AM, Tom Honermann via SG16 wrote:

SG16 will hold a telecon on Wednesday, May 12th at 19:30 UTC (timezone conversion).

The agenda is:

Our last telecon was consumed by discussion of LWG3547 and possible remedies.  Though we did not reach consensus on a direction forward during that telecon, Victor and Corentin, at the LEWG chair's request, drafted D2372R0, presented it at the LEWG telecon held 2021-05-03, and LEWG reached strong consensus for it.  The D2372R0 revision will be submitted for the May mailing as P2372R0; and a D2372R1 revision addressing LEWG feedback will be submitted as P2372R1.  Both revisions substantially match the proposed resolution that SG16 discussed.  Since SG16 did not reach consensus on that direction, the LEWG chair has asked that we revisit it to either affirm or rebut the LEWG consensus.  We will therefore (briefly) discuss and then poll that direction.  Note that the poll taken in SG16 differs from the poll taken in LEWG.  In SG16, we polled applying the proposed resolution to C++23 while LEWG polled applying the proposed resolution (with amendments to not change behavior for iostream manipulators) to C++23 *and* retroactively to C++20.

Once we've dispatched D2372R1, we'll return to the original agenda for our last telecon; discussion of P2093R5 (Formatted output) and P2295R2 (Support for UTF-8 as a portable source file encoding).  I've retained P2348R0 on the agenda, though I don't expect that we'll get to it.

With regard to P2093R5, the current status is that LEWG has referred the paper back to SG16 for further discussion; please see the LEWG meeting minutes here.  Specifically, LEWG would benefit from additional analysis of previously deferred questions regarding character encoding concerns, transcoding requirements (or the lack there of) and the ensuing consequences (or lack there of).

  1. How errors in transcoding should be handled.  E.g., when transcoding from UTF-8 to a UTF-16 based console interface and the UTF-8 input is not well-formed.
  2. The choice to base behavior on the compile-time choice of literal encoding.  An implication of the current proposal is that a program that contains only ASCII characters in string literals will change behavior depending on whether the literal encoding is UTF-8 vs ASCII (or some other ASCII derived encoding).
  3. Whether transcoding to the console interface encoding should be performed when the literal encoding is not UTF-8.
  4. What the implications are for future support of std::print("{} {} {} {}", L"Wide text", u8"UTF-8 text", u"UTF-16 text", U"UTF-32 text").

I think these concerns will be easier to resolve if we first reach consensus regarding scenarios in which localized text may be provided in an unexpected encoding.  The following is a slightly modified example of code Hubert previously provided.  The example has been modified to explicitly opt into localized chrono formatting.

std::print("{:L%p}\n", std::chrono::system_clock::now().time_since_epoch());

At issue is the encoding used by locale sensitive chrono formatters.  The example above contains the %p specifier and is locale sensitive because AM/PM designations may be localized.  In a Chinese locale the desired translation of "PM" is "下午", but the locale will provide the translation in the locale encoding.  As specified in P2093R5, if the literal encoding is UTF-8, than std::print() will expect the translation to be provided in UTF-8, but if the locale is not UTF-8-based (e.g., Big5; perhaps Shift-JIS for the Japanese 午後 translation), then the result is mojibake.

I had previously suggested the following possible directions we can investigate to resolve the encoding concerns.

  • Specialize std::locale facets and related I/O manipulators like std::put_time() for char8_t.  This would allow std::print() to, when the literal encoding is UTF-8, opt-in to use of the UTF-8/char8_t facets and I/O manipulators.
  • When the literal encoding is UTF-8, stipulate that running the program in a non-UTF-8 based locale is non-conforming.  This would effectively require MSVC programmers to, when building code with the /utf-8 option, to also force selection of a UTF-8 code page via a manifest and require use of Windows 10 build 1903 or later.
  • When the literal encoding is UTF-8, specify that non-UTF-8 based locale dependent translations be implicitly transcoded (such transcoding should never result in errors except perhaps for memory allocation failures).
  • Drop the special case handling for the literal encoding being UTF-8 and specify that, when bypassing a stream to write directly to the console, that the output be implicitly transcoded from the current locale dependent encoding (whatever it is) to the console encoding (UTF-8). 

If we get through all of that, we'll review Corentin's updates in P2295R2 to address prior SG16 feedback.  Thank you to everyone that already provided additional feedback on the mailing list!

Tom.



--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16
--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16