On Tue, May 25, 2021 at 3:08 PM Tom Honermann via SG16 <sg16@lists.isocpp.org> wrote:

Reminder that this meeting is taking place tomorrow. The agenda remains the same.

Tom.

On 5/16/21 5:23 PM, Tom Honermann via SG16 wrote:

SG16 will hold a telecon on Wednesday, May 26th at 19:30 UTC (timezone conversion).

The agenda is:

P2295R3: Support for UTF-8 as a portable source file encoding

Review updates intended to address prior SG16 feedback.

P2093R6: Formatted output

Discuss locale dependent character encoding concerns.

Since we did not get to discuss P2295R3 at our last telecon, it will again retain the top spot on the agenda followed by P2093R6. Thus, the agenda looks much the same as for the last telecon (I dropped P2348R0 for now; we won't realistically get to it).

I will try to be there, no promise though.

Btw I would love feedback on P2348. There is little but wording in this paper so mail might be as good or better avenue for such feedback :)

With regard to P2093R6, the current status is unchanged; LEWG has referred the paper back to SG16 for further discussion; please see the LEWG meeting minutes here. Specifically, LEWG would benefit from additional analysis of previously deferred questions regarding character encoding concerns, transcoding requirements (or the lack there of) and the ensuing consequences (or lack there of).

How errors in transcoding should be handled. E.g., when transcoding from UTF-8 to a UTF-16 based console interface and the UTF-8 input is not well-formed.

The choice to base behavior on the compile-time choice of literal encoding. An implication of the current proposal is that a program that contains only ASCII characters in string literals will change behavior depending on whether the literal encoding is UTF-8 vs ASCII (or some other ASCII derived encoding).

Whether transcoding to the console interface encoding should be performed when the literal encoding is not UTF-8.

What the implications are for future support of std::print("{} {} {}{}", L"Wide text", u8"UTF-8 text", u"UTF-16 text", U"UTF-32 text").

At our last telecon, we focused on how to handle ill-formed inputs, but did not much discuss how such inputs arise. Now that LWG3547 has been effectively (though not officially) resolved by P2372R1, we have a concrete example of how the std::print() facility itself can produce ill-formed input (assuming that std::print() transcodes all inputs using the same encoding). I would like to start with this example as I think it is fundamental to how we choose to answer the above questions.

std::print("{:L%p}\n", std::chrono::system_clock::now().time_since_epoch());

At issue is the encoding used by chrono formatters specified with the L option to request a locale specific form. The example above contains the %p specifier with the L option. In a Chinese locale the desired translation of "PM" is "下午", but the locale will provide the translation in the locale encoding. As specified in P2093R6, if the literal encoding is UTF-8, than std::print() will expect the translation to be provided in UTF-8, but if the locale is not UTF-8-based (e.g., Big5; perhaps Shift-JIS for the Japanese 午後 translation), then the result is mojibake.

These are possible directions we can investigate to resolve the encoding concerns.

Specialize std::locale facets and related I/O manipulators like std::put_time() for char8_t. This would allow std::print() to, when the literal encoding is UTF-8, opt-in to use of the UTF-8/char8_t facets and I/O manipulators.

When the literal encoding is UTF-8, stipulate that running the program in a non-UTF-8 based locale is non-conforming. This would effectively require MSVC programmers to, when building code with the /utf-8 option, to also force selection of a UTF-8 code page via a manifest and require use of Windows 10 build 1903 or later.

When the literal encoding is UTF-8, specify that non-UTF-8 based locale dependent translations be implicitly transcoded (such transcoding should never result in errors except perhaps for memory allocation failures).

Drop the special case handling for the literal encoding being UTF-8 and specify that, when bypassing a stream to write directly to the console, that the output be implicitly transcoded from the current locale dependent encoding (whatever it is) to the console encoding (UTF-8).

Tom.

--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16