ISOCPP sg16 List: Re: [SC22WG14.22383] Agenda for the 2022-07-27 SG16 telecon

From: Marcus Johnson <marcusljohnson1991_at_[hidden]>
Date: Mon, 25 Jul 2022 18:05:14 -0400

Hey Tom, here's the latest version with the feedback from WG14: https://drive.google.com/file/d/1_cMp-Td_1t0GfwngKN9a8s8hWGJSeASt/view?usp=sharing

> On Jul 21, 2022, at 6:46 PM, Tom Honermann <tom_at_[hidden]> wrote:
>
> SG16 will hold a telecon on Wednesday, July 27th, at 19:30 UTC (timezone conversion <https://www.timeanddate.com/worldclock/converter.html?iso=20220727T193000&p1=1440&p2=tz_pdt&p3=tz_mdt&p4=tz_cdt&p5=tz_edt&p6=tz_cest>).
>
> Please note that this message is being sent to the WG14 mailing list.
>
> Interested WG14 members are encouraged to attend this meeting. A calendar event (.ics) file containing the meeting details is attached. Alternatively, meeting details can be found here <https://documents.isocpp.org/index.php/apps/calendar/p/R7imgS2LJD9xfeWN/dayGridMonth/now/view/sidebar/L3JlbW90ZS5waHAvZGF2L3B1YmxpYy1jYWxlbmRhcnMvUjdpbWdTMkxKRDl4ZmVXTi81QkE1NTVFRC0xOTFCLTRERUQtQUFFMi01Q0Q1OTQwMDM4NjYuaWNz/1658950200>.
>
> The agenda is:
>
> WG14 N3016: Unicode Length Modifiers v3 <https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3016.pdf>
> The linked paper proposes additional length modifiers (U8, U16, and U32) for the printf() and scanf() family of functions that enable them to write and read UTF-8, UTF-16, and UTF-32 encoded text in char8_t, char16_t, and char32_t based storage via conversion from/to the (locale sensitive) execution encoding (consistent with conversions that are performed for text in wchar_t based storage). For example:
>
> printf("From a UTF-8 string: %U8s\n", u8"text");
> printf("From a UTF-16 character: %U16c\n", u'X');
>
> WG14 discussed the paper during their committee meeting this week but declined to adopt it for C23 due to general concerns about encoding issues, a desire to consider the larger design space, and dependencies on text conversion facilities not currently required by the C standard. The encoding concerns match those we've discussed before and underscore the reasons that none of std::format(), std::print(), or C++ iostreams support output from UTF encoded text in char8_t, char16_t, and char32_t based storage.
>
> Consider the following code and the existing text conversion support currently required for wide strings (the contents of ws will be converted to the locale sensitive execution encoding).
>
> wchar_t ws[] = L"...";
> printf("<text>: %ls\n", ws);
>
> Programmers using an implementation that encodes string literals as UTF-8 will most likely expect the example to produce UTF-8 output regardless of the execution encoding associated with the run-time locale. However, if run in an environment that uses a locale with a different encoding (e.g., Windows-1252 as is the common case for Windows machines located in the United States), then the output will contain a mix of UTF-8 and non-UTF-8 encoded text.
>
> The same problem occurs for C++ with:
>
> std::cout << "<text>: " << ws << "\n";
>
> WG21 has so far avoided these concerns with regard to char8_t, char16_t, and char32_t; no support is currently provided for formatting text in storage of these types with any of std::format(), std::print(), or iostreams. This limits the usability of these types and portable support for UTF encoded text in general.
>
> In this meeting, we'll discuss these concerns and the design space for improving the situation. Some items to consider:
>
> When designing std::format() and std::print(), WG21 has chosen to ignore the locale dependent execution encoding in several situations when the encoding of string literals is known to be UTF-8.
> Many implementations offer text conversion facilities as part of their I/O environment:
> Microsoft's fopen() implementation allows a file to be opened as text with a specified encoding. From Microsoft's documentation <https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/fopen-wfopen?view=msvc-170>:
> FILE *fp = fopen("newfile.txt", "rt+, ccs=UTF-8");
> GNU libc's fopen() implementation similarly allows an associated encoding via a "ccs" mode string. See Linux documentation <https://man7.org/linux/man-pages/man3/fopen.3.html>.
> IBM's z/OS allows an encoding to be associated with a file as a filesystem attribute. See IBM's chtag documentation <https://www.ibm.com/docs/en/zos/2.3.0?topic=descriptions-chtag-change-file-tag-information>. z/OS also supports associating an encoding and enabling conversions for file streams, but I wasn't able to find documentation just now.
> Tom.
>
> <sg16-2022-07-27.ics>

Received on 2022-07-25 22:05:17