On 7/25/22 7:30 PM, Marcus Johnson via SG16 wrote:
Quick update on what's changed, I implemented most feedback from WG14, and I've decoupled my paper from JeanHeyd's, we're simply writing Unicode as it is to the associated stream or reading from that stream into the variable as indicated by the user.

No character set conversions or otherwise anti-WYSIWYG behavior, just reading and writing Unicode.

I don't understand what that would mean. Do you mean writing the bytes of each code point in native endian order? That would not be useful.

Encoding conversions are absolutely required for this feature to make sense.

Tom.


On Jul 25, 2022, at 6:05 PM, Marcus Johnson <marcusljohnson1991@gmail.com> wrote:

Hey Tom, here's the latest version with the feedback from WG14: https://drive.google.com/file/d/1_cMp-Td_1t0GfwngKN9a8s8hWGJSeASt/view?usp=sharing

On Jul 21, 2022, at 6:46 PM, Tom Honermann <tom@honermann.net> wrote:

SG16 will hold a telecon on Wednesday, July 27th, at 19:30 UTC (timezone conversion).

Please note that this message is being sent to the WG14 mailing list.

Interested WG14 members are encouraged to attend this meeting. A calendar event (.ics) file containing the meeting details is attached. Alternatively, meeting details can be found here.

The agenda is:

The linked paper proposes additional length modifiers (U8, U16, and U32) for the printf() and scanf() family of functions that enable them to write and read UTF-8, UTF-16, and UTF-32 encoded text in char8_t, char16_t, and char32_t based storage via conversion from/to the (locale sensitive) execution encoding (consistent with conversions that are performed for text in wchar_t based storage). For example:

printf("From a UTF-8 string: %U8s\n", u8"text");
printf("From a UTF-16 character: %U16c\n", u'X');

WG14 discussed the paper during their committee meeting this week but declined to adopt it for C23 due to general concerns about encoding issues, a desire to consider the larger design space, and dependencies on text conversion facilities not currently required by the C standard. The encoding concerns match those we've discussed before and underscore the reasons that none of std::format(), std::print(), or C++ iostreams support output from UTF encoded text in char8_t, char16_t, and char32_t based storage.

Consider the following code and the existing text conversion support currently required for wide strings (the contents of ws will be converted to the locale sensitive execution encoding).

wchar_t ws[] = L"...";
printf("<text>: %ls\n", ws);

Programmers using an implementation that encodes string literals as UTF-8 will most likely expect the example to produce UTF-8 output regardless of the execution encoding associated with the run-time locale. However, if run in an environment that uses a locale with a different encoding (e.g., Windows-1252 as is the common case for Windows machines located in the United States), then the output will contain a mix of UTF-8 and non-UTF-8 encoded text.

The same problem occurs for C++ with:

std::cout << "<text>: " << ws << "\n";

WG21 has so far avoided these concerns with regard to char8_t, char16_t, and char32_t; no support is currently provided for formatting text in storage of these types with any of std::format(), std::print(), or iostreams. This limits the usability of these types and portable support for UTF encoded text in general.

In this meeting, we'll discuss these concerns and the design space for improving the situation. Some items to consider:

  1. When designing std::format() and std::print(), WG21 has chosen to ignore the locale dependent execution encoding in several situations when the encoding of string literals is known to be UTF-8.
  2. Many implementations offer text conversion facilities as part of their I/O environment:
    1. Microsoft's fopen() implementation allows a file to be opened as text with a specified encoding. From Microsoft's documentation:
      FILE *fp = fopen("newfile.txt", "rt+, ccs=UTF-8");
    2. GNU libc's fopen() implementation similarly allows an associated encoding via a "ccs" mode string. See Linux documentation.
    3. IBM's z/OS allows an encoding to be associated with a file as a filesystem attribute. See IBM's chtag documentation. z/OS also supports associating an encoding and enabling conversions for file streams, but I wasn't able to find documentation just now.

Tom.

<sg16-2022-07-27.ics>