ISOCPP sg16 List: Re: [SC22WG14.22423] Agenda for the 2022-07-27 SG16 telecon

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 26 Jul 2022 17:50:47 -0400

On 7/26/22 1:37 PM, Marcus Johnson via SG16 wrote:
> There’s two real options, we can specify little endian for UTF-16 and
> UTF-32, or we can specify a BOM is prefixed to the output.
>
> The latter would probably surprise users about the lemgth of the
> output string, and to say that big endian is a niche option is an
> understatement, so I’m leaning towards specifying little endian.

Neither of these options provides a useful result. Consider:

printf("Hi, %U16s\n", u"Marcus");

The output would start with "Hi, " encoded in the string literal
encoding and then be followed by the UTF-16 encoding of "Marcus" (in
some endian byte order). Mixing encodings like this produces mojibake;
this isn't useful behavior.

What we ideally want is for the formatted I/O to be consistently
encoded. Unfortunately, we don't have a way to know the programmer's
intentions and there are at least two reasonable behaviors:

1. Assume that the string literal encoding (used for the format string)
    encodes characters in a compatible subset of the run-time locale
    encoding and convert UTF arguments to the run-time locale encoding.
    This is the existing practice with the 'l' wide string conversion
    specifier.
2. Convert the UTF arguments to the same encoding as the (compile-time)
    string literal encoding. This would require a novel mechanism to
    inform formatted I/O functions at run-time which encoding was used
    for string literals in the TU that is invoking the formatted I/O
    function (note that different TUs may be compiled with different
    string literal encodings (e.g., via different -fexec-charset gcc
    options). Some implementations also support pragma directives that
    can change the string literal encoding in the middle of a TU).

If we want to do better than the above, then I personally think we need
a way for the programmer to associate an encoding with an I/O stream
such that formatted I/O functions can then convert all of their
inputs/outputs (including the format string) to/from the stream encoding.

Tom.

>
>> On Jul 26, 2022, at 12:47 PM, Marcus Johnson
>> <MarcusLJohnson1991_at_[hidden]> wrote:
>>
>> Good point, I wasn't accounting for byte order, Not sure how to add
>> the wording to that, something to think about.
>>
>>> On Jul 26, 2022, at 11:57 AM, Tom Honermann <tom_at_[hidden]> wrote:
>>>
>>> On 7/25/22 7:30 PM, Marcus Johnson via SG16 wrote:
>>>> Quick update on what's changed, I implemented most feedback from
>>>> WG14, and I've decoupled my paper from JeanHeyd's, we're simply
>>>> writing Unicode as it is to the associated stream or reading from
>>>> that stream into the variable as indicated by the user.
>>>>
>>>> No character set conversions or otherwise anti-WYSIWYG behavior,
>>>> just reading and writing Unicode.
>>>
>>> I don't understand what that would mean. Do you mean writing the
>>> bytes of each code point in native endian order? That would not be
>>> useful.
>>>
>>> Encoding conversions are absolutely required for this feature to
>>> make sense.
>>>
>>> Tom.
>>>
>>>>
>>>>> On Jul 25, 2022, at 6:05 PM, Marcus Johnson
>>>>> <marcusljohnson1991_at_[hidden]> wrote:
>>>>>
>>>>> Hey Tom, here's the latest version with the feedback from WG14:
>>>>> https://drive.google.com/file/d/1_cMp-Td_1t0GfwngKN9a8s8hWGJSeASt/view?usp=sharing
>>>>>
>>>>>
>>>>>> On Jul 21, 2022, at 6:46 PM, Tom Honermann <tom_at_[hidden]> wrote:
>>>>>>
>>>>>> SG16 will hold a telecon on Wednesday, July 27th, at 19:30 UTC
>>>>>> (timezone conversion
>>>>>> <https://www.timeanddate.com/worldclock/converter.html?iso=20220727T193000&p1=1440&p2=tz_pdt&p3=tz_mdt&p4=tz_cdt&p5=tz_edt&p6=tz_cest>).
>>>>>>
>>>>>> Please note that this message is being sent to the WG14 mailing list.
>>>>>>
>>>>>> Interested WG14 members are encouraged to attend this meeting. A
>>>>>> calendar event (.ics) file containing the meeting details is
>>>>>> attached. Alternatively, meeting details can be found here
>>>>>> <https://documents.isocpp.org/index.php/apps/calendar/p/R7imgS2LJD9xfeWN/dayGridMonth/now/view/sidebar/L3JlbW90ZS5waHAvZGF2L3B1YmxpYy1jYWxlbmRhcnMvUjdpbWdTMkxKRDl4ZmVXTi81QkE1NTVFRC0xOTFCLTRERUQtQUFFMi01Q0Q1OTQwMDM4NjYuaWNz/1658950200>.
>>>>>>
>>>>>> The agenda is:
>>>>>>
>>>>>> * WG14 N3016: Unicode Length Modifiers v3
>>>>>> <https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3016.pdf>
>>>>>>
>>>>>> The linked paper proposes additional length modifiers (U8, U16,
>>>>>> and U32) for the printf() and scanf() family of functions that
>>>>>> enable them to write and read UTF-8, UTF-16, and UTF-32 encoded
>>>>>> text in char8_t, char16_t, and char32_t based storage via
>>>>>> conversion from/to the (locale sensitive) execution encoding
>>>>>> (consistent with conversions that are performed for text in
>>>>>> wchar_t based storage). For example:
>>>>>>
>>>>>> printf("From a UTF-8 string: %*U8*s\n", u8"text");
>>>>>> printf("From a UTF-16 character: %*U16*c\n", u'X');
>>>>>>
>>>>>> WG14 discussed the paper during their committee meeting this week
>>>>>> but declined to adopt it for C23 due to general concerns about
>>>>>> encoding issues, a desire to consider the larger design space,
>>>>>> and dependencies on text conversion facilities not currently
>>>>>> required by the C standard. The encoding concerns match those
>>>>>> we've discussed before and underscore the reasons that none of
>>>>>> std::format(), std::print(), or C++ iostreams support output from
>>>>>> UTF encoded text in char8_t, char16_t, and char32_t based storage.
>>>>>>
>>>>>> Consider the following code and the existing text conversion
>>>>>> support currently required for wide strings (the contents of ws
>>>>>> will be converted to the locale sensitive execution encoding).
>>>>>>
>>>>>> wchar_t ws[] = L"...";
>>>>>> printf("<text>: %ls\n", ws);
>>>>>>
>>>>>> Programmers using an implementation that encodes string literals
>>>>>> as UTF-8 will most likely expect the example to produce UTF-8
>>>>>> output regardless of the execution encoding associated with the
>>>>>> run-time locale. However, if run in an environment that uses a
>>>>>> locale with a different encoding (e.g., Windows-1252 as is the
>>>>>> common case for Windows machines located in the United States),
>>>>>> then the output will contain a mix of UTF-8 and non-UTF-8 encoded
>>>>>> text.
>>>>>>
>>>>>> The same problem occurs for C++ with:
>>>>>>
>>>>>> std::cout << "<text>: " << ws << "\n";
>>>>>>
>>>>>> WG21 has so far avoided these concerns with regard to char8_t,
>>>>>> char16_t, and char32_t; no support is currently provided for
>>>>>> formatting text in storage of these types with any of
>>>>>> std::format(), std::print(), or iostreams. This limits the
>>>>>> usability of these types and portable support for UTF encoded
>>>>>> text in general.
>>>>>>
>>>>>> In this meeting, we'll discuss these concerns and the design
>>>>>> space for improving the situation. Some items to consider:
>>>>>>
>>>>>> 1. When designing std::format() and std::print(), WG21 has
>>>>>> chosen to ignore the locale dependent execution encoding in
>>>>>> several situations when the encoding of string literals is
>>>>>> known to be UTF-8.
>>>>>> 2. Many implementations offer text conversion facilities as part
>>>>>> of their I/O environment:
>>>>>> 1. Microsoft's fopen() implementation allows a file to be
>>>>>> opened as text with a specified encoding. From
>>>>>> Microsoft's documentation
>>>>>> <https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/fopen-wfopen?view=msvc-170>:
>>>>>> FILE *fp = fopen("newfile.txt", "rt+, ccs=UTF-8");
>>>>>> 2. GNU libc's fopen() implementation similarly allows an
>>>>>> associated encoding via a "ccs" mode string. See Linux
>>>>>> documentation
>>>>>> <https://man7.org/linux/man-pages/man3/fopen.3.html>.
>>>>>> 3. IBM's z/OS allows an encoding to be associated with a
>>>>>> file as a filesystem attribute. See IBM's chtag
>>>>>> documentation
>>>>>> <https://www.ibm.com/docs/en/zos/2.3.0?topic=descriptions-chtag-change-file-tag-information>.
>>>>>> z/OS also supports associating an encoding and enabling
>>>>>> conversions for file streams, but I wasn't able to find
>>>>>> documentation just now.
>>>>>>
>>>>>> Tom.
>>>>>>
>>>>>> <sg16-2022-07-27.ics>
>>>>>
>>>>
>>>>
>>
>

Received on 2022-07-26 21:50:50