ISOCPP sg16 List: Agenda for the 2022-07-27 SG16 telecon

From: Tom Honermann <tom_at_[hidden]>
Date: Thu, 21 Jul 2022 18:46:02 -0400

SG16 will hold a telecon on Wednesday, July 27th, at 19:30 UTC (timezone
conversion
<https://www.timeanddate.com/worldclock/converter.html?iso=20220727T193000&p1=1440&p2=tz_pdt&p3=tz_mdt&p4=tz_cdt&p5=tz_edt&p6=tz_cest>).

Please note that this message is being sent to the WG14 mailing list.

Interested WG14 members are encouraged to attend this meeting. A
calendar event (.ics) file containing the meeting details is attached.
Alternatively, meeting details can be found here
<https://documents.isocpp.org/index.php/apps/calendar/p/R7imgS2LJD9xfeWN/dayGridMonth/now/view/sidebar/L3JlbW90ZS5waHAvZGF2L3B1YmxpYy1jYWxlbmRhcnMvUjdpbWdTMkxKRDl4ZmVXTi81QkE1NTVFRC0xOTFCLTRERUQtQUFFMi01Q0Q1OTQwMDM4NjYuaWNz/1658950200>.

The agenda is:

  * WG14 N3016: Unicode Length Modifiers v3
    <https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3016.pdf>

The linked paper proposes additional length modifiers (U8, U16, and U32)
for the printf() and scanf() family of functions that enable them to
write and read UTF-8, UTF-16, and UTF-32 encoded text in char8_t,
char16_t, and char32_t based storage via conversion from/to the (locale
sensitive) execution encoding (consistent with conversions that are
performed for text in wchar_t based storage). For example:

    printf("From a UTF-8 string: %*U8*s\n", u8"text");
    printf("From a UTF-16 character: %*U16*c\n", u'X');

WG14 discussed the paper during their committee meeting this week but
declined to adopt it for C23 due to general concerns about encoding
issues, a desire to consider the larger design space, and dependencies
on text conversion facilities not currently required by the C standard.
The encoding concerns match those we've discussed before and underscore
the reasons that none of std::format(), std::print(), or C++ iostreams
support output from UTF encoded text in char8_t, char16_t, and char32_t
based storage.

Consider the following code and the existing text conversion support
currently required for wide strings (the contents of ws will be
converted to the locale sensitive execution encoding).

    wchar_t ws[] = L"...";
    printf("<text>: %ls\n", ws);

Programmers using an implementation that encodes string literals as
UTF-8 will most likely expect the example to produce UTF-8 output
regardless of the execution encoding associated with the run-time
locale. However, if run in an environment that uses a locale with a
different encoding (e.g., Windows-1252 as is the common case for Windows
machines located in the United States), then the output will contain a
mix of UTF-8 and non-UTF-8 encoded text.

The same problem occurs for C++ with:

    std::cout << "<text>: " << ws << "\n";

WG21 has so far avoided these concerns with regard to char8_t, char16_t,
and char32_t; no support is currently provided for formatting text in
storage of these types with any of std::format(), std::print(), or
iostreams. This limits the usability of these types and portable support
for UTF encoded text in general.

In this meeting, we'll discuss these concerns and the design space for
improving the situation. Some items to consider:

1. When designing std::format() and std::print(), WG21 has chosen to
    ignore the locale dependent execution encoding in several situations
    when the encoding of string literals is known to be UTF-8.
2. Many implementations offer text conversion facilities as part of
    their I/O environment:
     1. Microsoft's fopen() implementation allows a file to be opened as
        text with a specified encoding. From Microsoft's documentation
        <https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/fopen-wfopen?view=msvc-170>:
        FILE *fp = fopen("newfile.txt", "rt+, ccs=UTF-8");
     2. GNU libc's fopen() implementation similarly allows an associated
        encoding via a "ccs" mode string. See Linux documentation
        <https://man7.org/linux/man-pages/man3/fopen.3.html>.
     3. IBM's z/OS allows an encoding to be associated with a file as a
        filesystem attribute. See IBM's chtag documentation
        <https://www.ibm.com/docs/en/zos/2.3.0?topic=descriptions-chtag-change-file-tag-information>.
        z/OS also supports associating an encoding and enabling
        conversions for file streams, but I wasn't able to find
        documentation just now.

Tom.

Received on 2022-07-21 22:46:04