ISOCPP sg16 List: Re: [SC22WG14.22413] Agenda for the 2022-07-27 SG16 telecon

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 26 Jul 2022 11:57:52 -0400

On 7/25/22 7:30 PM, Marcus Johnson via SG16 wrote:
> Quick update on what's changed, I implemented most feedback from WG14,
> and I've decoupled my paper from JeanHeyd's, we're simply writing
> Unicode as it is to the associated stream or reading from that stream
> into the variable as indicated by the user.
>
> No character set conversions or otherwise anti-WYSIWYG behavior, just
> reading and writing Unicode.

I don't understand what that would mean. Do you mean writing the bytes
of each code point in native endian order? That would not be useful.

Encoding conversions are absolutely required for this feature to make sense.

Tom.

>
>> On Jul 25, 2022, at 6:05 PM, Marcus Johnson
>> <marcusljohnson1991_at_[hidden]> wrote:
>>
>> Hey Tom, here's the latest version with the feedback from WG14:
>> https://drive.google.com/file/d/1_cMp-Td_1t0GfwngKN9a8s8hWGJSeASt/view?usp=sharing
>>
>>
>>> On Jul 21, 2022, at 6:46 PM, Tom Honermann <tom_at_[hidden]> wrote:
>>>
>>> SG16 will hold a telecon on Wednesday, July 27th, at 19:30 UTC
>>> (timezone conversion
>>> <https://www.timeanddate.com/worldclock/converter.html?iso=20220727T193000&p1=1440&p2=tz_pdt&p3=tz_mdt&p4=tz_cdt&p5=tz_edt&p6=tz_cest>).
>>>
>>> Please note that this message is being sent to the WG14 mailing list.
>>>
>>> Interested WG14 members are encouraged to attend this meeting. A
>>> calendar event (.ics) file containing the meeting details is
>>> attached. Alternatively, meeting details can be found here
>>> <https://documents.isocpp.org/index.php/apps/calendar/p/R7imgS2LJD9xfeWN/dayGridMonth/now/view/sidebar/L3JlbW90ZS5waHAvZGF2L3B1YmxpYy1jYWxlbmRhcnMvUjdpbWdTMkxKRDl4ZmVXTi81QkE1NTVFRC0xOTFCLTRERUQtQUFFMi01Q0Q1OTQwMDM4NjYuaWNz/1658950200>.
>>>
>>> The agenda is:
>>>
>>> * WG14 N3016: Unicode Length Modifiers v3
>>> <https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3016.pdf>
>>>
>>> The linked paper proposes additional length modifiers (U8, U16, and
>>> U32) for the printf() and scanf() family of functions that enable
>>> them to write and read UTF-8, UTF-16, and UTF-32 encoded text in
>>> char8_t, char16_t, and char32_t based storage via conversion from/to
>>> the (locale sensitive) execution encoding (consistent with
>>> conversions that are performed for text in wchar_t based storage).
>>> For example:
>>>
>>> printf("From a UTF-8 string: %*U8*s\n", u8"text");
>>> printf("From a UTF-16 character: %*U16*c\n", u'X');
>>>
>>> WG14 discussed the paper during their committee meeting this week
>>> but declined to adopt it for C23 due to general concerns about
>>> encoding issues, a desire to consider the larger design space, and
>>> dependencies on text conversion facilities not currently required by
>>> the C standard. The encoding concerns match those we've discussed
>>> before and underscore the reasons that none of std::format(),
>>> std::print(), or C++ iostreams support output from UTF encoded text
>>> in char8_t, char16_t, and char32_t based storage.
>>>
>>> Consider the following code and the existing text conversion support
>>> currently required for wide strings (the contents of ws will be
>>> converted to the locale sensitive execution encoding).
>>>
>>> wchar_t ws[] = L"...";
>>> printf("<text>: %ls\n", ws);
>>>
>>> Programmers using an implementation that encodes string literals as
>>> UTF-8 will most likely expect the example to produce UTF-8 output
>>> regardless of the execution encoding associated with the run-time
>>> locale. However, if run in an environment that uses a locale with a
>>> different encoding (e.g., Windows-1252 as is the common case for
>>> Windows machines located in the United States), then the output will
>>> contain a mix of UTF-8 and non-UTF-8 encoded text.
>>>
>>> The same problem occurs for C++ with:
>>>
>>> std::cout << "<text>: " << ws << "\n";
>>>
>>> WG21 has so far avoided these concerns with regard to char8_t,
>>> char16_t, and char32_t; no support is currently provided for
>>> formatting text in storage of these types with any of std::format(),
>>> std::print(), or iostreams. This limits the usability of these types
>>> and portable support for UTF encoded text in general.
>>>
>>> In this meeting, we'll discuss these concerns and the design space
>>> for improving the situation. Some items to consider:
>>>
>>> 1. When designing std::format() and std::print(), WG21 has chosen
>>> to ignore the locale dependent execution encoding in several
>>> situations when the encoding of string literals is known to be
>>> UTF-8.
>>> 2. Many implementations offer text conversion facilities as part of
>>> their I/O environment:
>>> 1. Microsoft's fopen() implementation allows a file to be
>>> opened as text with a specified encoding. From Microsoft's
>>> documentation
>>> <https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/fopen-wfopen?view=msvc-170>:
>>> FILE *fp = fopen("newfile.txt", "rt+, ccs=UTF-8");
>>> 2. GNU libc's fopen() implementation similarly allows an
>>> associated encoding via a "ccs" mode string. See Linux
>>> documentation
>>> <https://man7.org/linux/man-pages/man3/fopen.3.html>.
>>> 3. IBM's z/OS allows an encoding to be associated with a file
>>> as a filesystem attribute. See IBM's chtag documentation
>>> <https://www.ibm.com/docs/en/zos/2.3.0?topic=descriptions-chtag-change-file-tag-information>.
>>> z/OS also supports associating an encoding and enabling
>>> conversions for file streams, but I wasn't able to find
>>> documentation just now.
>>>
>>> Tom.
>>>
>>> <sg16-2022-07-27.ics>
>>
>
>

Received on 2022-07-26 15:57:54