Date: Tue, 26 Jul 2022 18:28:03 -0400
That is a REALLY good point about mixing locale sensitive character sets with Unicode, I never even thought of that possibility, I always did "%U8/16/32s" and just wrote the already processed Unicode string.
> On Jul 26, 2022, at 5:50 PM, Tom Honermann <tom_at_[hidden]> wrote:
>
> On 7/26/22 1:37 PM, Marcus Johnson via SG16 wrote:
>> There’s two real options, we can specify little endian for UTF-16 and UTF-32, or we can specify a BOM is prefixed to the output.
>>
>> The latter would probably surprise users about the lemgth of the output string, and to say that big endian is a niche option is an understatement, so I’m leaning towards specifying little endian.
> Neither of these options provides a useful result. Consider:
>
> printf("Hi, %U16s\n", u"Marcus");
>
> The output would start with "Hi, " encoded in the string literal encoding and then be followed by the UTF-16 encoding of "Marcus" (in some endian byte order). Mixing encodings like this produces mojibake; this isn't useful behavior.
>
> What we ideally want is for the formatted I/O to be consistently encoded. Unfortunately, we don't have a way to know the programmer's intentions and there are at least two reasonable behaviors:
>
> Assume that the string literal encoding (used for the format string) encodes characters in a compatible subset of the run-time locale encoding and convert UTF arguments to the run-time locale encoding. This is the existing practice with the 'l' wide string conversion specifier.
> Convert the UTF arguments to the same encoding as the (compile-time) string literal encoding. This would require a novel mechanism to inform formatted I/O functions at run-time which encoding was used for string literals in the TU that is invoking the formatted I/O function (note that different TUs may be compiled with different string literal encodings (e.g., via different -fexec-charset gcc options). Some implementations also support pragma directives that can change the string literal encoding in the middle of a TU).
> If we want to do better than the above, then I personally think we need a way for the programmer to associate an encoding with an I/O stream such that formatted I/O functions can then convert all of their inputs/outputs (including the format string) to/from the stream encoding.
>
> Tom.
>
>>
>>> On Jul 26, 2022, at 12:47 PM, Marcus Johnson <MarcusLJohnson1991_at_[hidden]> <mailto:MarcusLJohnson1991_at_[hidden]> wrote:
>>>
>>> Good point, I wasn't accounting for byte order, Not sure how to add the wording to that, something to think about.
>>>
>>>> On Jul 26, 2022, at 11:57 AM, Tom Honermann <tom_at_[hidden] <mailto:tom_at_[hidden]>> wrote:
>>>>
>>>> On 7/25/22 7:30 PM, Marcus Johnson via SG16 wrote:
>>>>> Quick update on what's changed, I implemented most feedback from WG14, and I've decoupled my paper from JeanHeyd's, we're simply writing Unicode as it is to the associated stream or reading from that stream into the variable as indicated by the user.
>>>>>
>>>>> No character set conversions or otherwise anti-WYSIWYG behavior, just reading and writing Unicode.
>>>> I don't understand what that would mean. Do you mean writing the bytes of each code point in native endian order? That would not be useful.
>>>>
>>>> Encoding conversions are absolutely required for this feature to make sense.
>>>>
>>>> Tom.
>>>>
>>>>>
>>>>>> On Jul 25, 2022, at 6:05 PM, Marcus Johnson <marcusljohnson1991_at_[hidden] <mailto:marcusljohnson1991_at_[hidden]>> wrote:
>>>>>>
>>>>>> Hey Tom, here's the latest version with the feedback from WG14: https://drive.google.com/file/d/1_cMp-Td_1t0GfwngKN9a8s8hWGJSeASt/view?usp=sharing <https://drive.google.com/file/d/1_cMp-Td_1t0GfwngKN9a8s8hWGJSeASt/view?usp=sharing>
>>>>>>
>>>>>>> On Jul 21, 2022, at 6:46 PM, Tom Honermann <tom_at_[hidden] <mailto:tom_at_[hidden]>> wrote:
>>>>>>>
>>>>>>> SG16 will hold a telecon on Wednesday, July 27th, at 19:30 UTC (timezone conversion <https://www.timeanddate.com/worldclock/converter.html?iso=20220727T193000&p1=1440&p2=tz_pdt&p3=tz_mdt&p4=tz_cdt&p5=tz_edt&p6=tz_cest>).
>>>>>>>
>>>>>>> Please note that this message is being sent to the WG14 mailing list.
>>>>>>>
>>>>>>> Interested WG14 members are encouraged to attend this meeting. A calendar event (.ics) file containing the meeting details is attached. Alternatively, meeting details can be found here <https://documents.isocpp.org/index.php/apps/calendar/p/R7imgS2LJD9xfeWN/dayGridMonth/now/view/sidebar/L3JlbW90ZS5waHAvZGF2L3B1YmxpYy1jYWxlbmRhcnMvUjdpbWdTMkxKRDl4ZmVXTi81QkE1NTVFRC0xOTFCLTRERUQtQUFFMi01Q0Q1OTQwMDM4NjYuaWNz/1658950200>.
>>>>>>>
>>>>>>> The agenda is:
>>>>>>>
>>>>>>> WG14 N3016: Unicode Length Modifiers v3 <https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3016.pdf>
>>>>>>> The linked paper proposes additional length modifiers (U8, U16, and U32) for the printf() and scanf() family of functions that enable them to write and read UTF-8, UTF-16, and UTF-32 encoded text in char8_t, char16_t, and char32_t based storage via conversion from/to the (locale sensitive) execution encoding (consistent with conversions that are performed for text in wchar_t based storage). For example:
>>>>>>>
>>>>>>> printf("From a UTF-8 string: %U8s\n", u8"text");
>>>>>>> printf("From a UTF-16 character: %U16c\n", u'X');
>>>>>>>
>>>>>>> WG14 discussed the paper during their committee meeting this week but declined to adopt it for C23 due to general concerns about encoding issues, a desire to consider the larger design space, and dependencies on text conversion facilities not currently required by the C standard. The encoding concerns match those we've discussed before and underscore the reasons that none of std::format(), std::print(), or C++ iostreams support output from UTF encoded text in char8_t, char16_t, and char32_t based storage.
>>>>>>>
>>>>>>> Consider the following code and the existing text conversion support currently required for wide strings (the contents of ws will be converted to the locale sensitive execution encoding).
>>>>>>>
>>>>>>> wchar_t ws[] = L"...";
>>>>>>> printf("<text>: %ls\n", ws);
>>>>>>>
>>>>>>> Programmers using an implementation that encodes string literals as UTF-8 will most likely expect the example to produce UTF-8 output regardless of the execution encoding associated with the run-time locale. However, if run in an environment that uses a locale with a different encoding (e.g., Windows-1252 as is the common case for Windows machines located in the United States), then the output will contain a mix of UTF-8 and non-UTF-8 encoded text.
>>>>>>>
>>>>>>> The same problem occurs for C++ with:
>>>>>>>
>>>>>>> std::cout << "<text>: " << ws << "\n";
>>>>>>>
>>>>>>> WG21 has so far avoided these concerns with regard to char8_t, char16_t, and char32_t; no support is currently provided for formatting text in storage of these types with any of std::format(), std::print(), or iostreams. This limits the usability of these types and portable support for UTF encoded text in general.
>>>>>>>
>>>>>>> In this meeting, we'll discuss these concerns and the design space for improving the situation. Some items to consider:
>>>>>>>
>>>>>>> When designing std::format() and std::print(), WG21 has chosen to ignore the locale dependent execution encoding in several situations when the encoding of string literals is known to be UTF-8.
>>>>>>> Many implementations offer text conversion facilities as part of their I/O environment:
>>>>>>> Microsoft's fopen() implementation allows a file to be opened as text with a specified encoding. From Microsoft's documentation <https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/fopen-wfopen?view=msvc-170>:
>>>>>>> FILE *fp = fopen("newfile.txt", "rt+, ccs=UTF-8");
>>>>>>> GNU libc's fopen() implementation similarly allows an associated encoding via a "ccs" mode string. See Linux documentation <https://man7.org/linux/man-pages/man3/fopen.3.html>.
>>>>>>> IBM's z/OS allows an encoding to be associated with a file as a filesystem attribute. See IBM's chtag documentation <https://www.ibm.com/docs/en/zos/2.3.0?topic=descriptions-chtag-change-file-tag-information>. z/OS also supports associating an encoding and enabling conversions for file streams, but I wasn't able to find documentation just now.
>>>>>>> Tom.
>>>>>>>
>>>>>>> <sg16-2022-07-27.ics>
>>>>>>
>>>>>
>>>>>
>>>
>>
> On Jul 26, 2022, at 5:50 PM, Tom Honermann <tom_at_[hidden]> wrote:
>
> On 7/26/22 1:37 PM, Marcus Johnson via SG16 wrote:
>> There’s two real options, we can specify little endian for UTF-16 and UTF-32, or we can specify a BOM is prefixed to the output.
>>
>> The latter would probably surprise users about the lemgth of the output string, and to say that big endian is a niche option is an understatement, so I’m leaning towards specifying little endian.
> Neither of these options provides a useful result. Consider:
>
> printf("Hi, %U16s\n", u"Marcus");
>
> The output would start with "Hi, " encoded in the string literal encoding and then be followed by the UTF-16 encoding of "Marcus" (in some endian byte order). Mixing encodings like this produces mojibake; this isn't useful behavior.
>
> What we ideally want is for the formatted I/O to be consistently encoded. Unfortunately, we don't have a way to know the programmer's intentions and there are at least two reasonable behaviors:
>
> Assume that the string literal encoding (used for the format string) encodes characters in a compatible subset of the run-time locale encoding and convert UTF arguments to the run-time locale encoding. This is the existing practice with the 'l' wide string conversion specifier.
> Convert the UTF arguments to the same encoding as the (compile-time) string literal encoding. This would require a novel mechanism to inform formatted I/O functions at run-time which encoding was used for string literals in the TU that is invoking the formatted I/O function (note that different TUs may be compiled with different string literal encodings (e.g., via different -fexec-charset gcc options). Some implementations also support pragma directives that can change the string literal encoding in the middle of a TU).
> If we want to do better than the above, then I personally think we need a way for the programmer to associate an encoding with an I/O stream such that formatted I/O functions can then convert all of their inputs/outputs (including the format string) to/from the stream encoding.
>
> Tom.
>
>>
>>> On Jul 26, 2022, at 12:47 PM, Marcus Johnson <MarcusLJohnson1991_at_[hidden]> <mailto:MarcusLJohnson1991_at_[hidden]> wrote:
>>>
>>> Good point, I wasn't accounting for byte order, Not sure how to add the wording to that, something to think about.
>>>
>>>> On Jul 26, 2022, at 11:57 AM, Tom Honermann <tom_at_[hidden] <mailto:tom_at_[hidden]>> wrote:
>>>>
>>>> On 7/25/22 7:30 PM, Marcus Johnson via SG16 wrote:
>>>>> Quick update on what's changed, I implemented most feedback from WG14, and I've decoupled my paper from JeanHeyd's, we're simply writing Unicode as it is to the associated stream or reading from that stream into the variable as indicated by the user.
>>>>>
>>>>> No character set conversions or otherwise anti-WYSIWYG behavior, just reading and writing Unicode.
>>>> I don't understand what that would mean. Do you mean writing the bytes of each code point in native endian order? That would not be useful.
>>>>
>>>> Encoding conversions are absolutely required for this feature to make sense.
>>>>
>>>> Tom.
>>>>
>>>>>
>>>>>> On Jul 25, 2022, at 6:05 PM, Marcus Johnson <marcusljohnson1991_at_[hidden] <mailto:marcusljohnson1991_at_[hidden]>> wrote:
>>>>>>
>>>>>> Hey Tom, here's the latest version with the feedback from WG14: https://drive.google.com/file/d/1_cMp-Td_1t0GfwngKN9a8s8hWGJSeASt/view?usp=sharing <https://drive.google.com/file/d/1_cMp-Td_1t0GfwngKN9a8s8hWGJSeASt/view?usp=sharing>
>>>>>>
>>>>>>> On Jul 21, 2022, at 6:46 PM, Tom Honermann <tom_at_[hidden] <mailto:tom_at_[hidden]>> wrote:
>>>>>>>
>>>>>>> SG16 will hold a telecon on Wednesday, July 27th, at 19:30 UTC (timezone conversion <https://www.timeanddate.com/worldclock/converter.html?iso=20220727T193000&p1=1440&p2=tz_pdt&p3=tz_mdt&p4=tz_cdt&p5=tz_edt&p6=tz_cest>).
>>>>>>>
>>>>>>> Please note that this message is being sent to the WG14 mailing list.
>>>>>>>
>>>>>>> Interested WG14 members are encouraged to attend this meeting. A calendar event (.ics) file containing the meeting details is attached. Alternatively, meeting details can be found here <https://documents.isocpp.org/index.php/apps/calendar/p/R7imgS2LJD9xfeWN/dayGridMonth/now/view/sidebar/L3JlbW90ZS5waHAvZGF2L3B1YmxpYy1jYWxlbmRhcnMvUjdpbWdTMkxKRDl4ZmVXTi81QkE1NTVFRC0xOTFCLTRERUQtQUFFMi01Q0Q1OTQwMDM4NjYuaWNz/1658950200>.
>>>>>>>
>>>>>>> The agenda is:
>>>>>>>
>>>>>>> WG14 N3016: Unicode Length Modifiers v3 <https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3016.pdf>
>>>>>>> The linked paper proposes additional length modifiers (U8, U16, and U32) for the printf() and scanf() family of functions that enable them to write and read UTF-8, UTF-16, and UTF-32 encoded text in char8_t, char16_t, and char32_t based storage via conversion from/to the (locale sensitive) execution encoding (consistent with conversions that are performed for text in wchar_t based storage). For example:
>>>>>>>
>>>>>>> printf("From a UTF-8 string: %U8s\n", u8"text");
>>>>>>> printf("From a UTF-16 character: %U16c\n", u'X');
>>>>>>>
>>>>>>> WG14 discussed the paper during their committee meeting this week but declined to adopt it for C23 due to general concerns about encoding issues, a desire to consider the larger design space, and dependencies on text conversion facilities not currently required by the C standard. The encoding concerns match those we've discussed before and underscore the reasons that none of std::format(), std::print(), or C++ iostreams support output from UTF encoded text in char8_t, char16_t, and char32_t based storage.
>>>>>>>
>>>>>>> Consider the following code and the existing text conversion support currently required for wide strings (the contents of ws will be converted to the locale sensitive execution encoding).
>>>>>>>
>>>>>>> wchar_t ws[] = L"...";
>>>>>>> printf("<text>: %ls\n", ws);
>>>>>>>
>>>>>>> Programmers using an implementation that encodes string literals as UTF-8 will most likely expect the example to produce UTF-8 output regardless of the execution encoding associated with the run-time locale. However, if run in an environment that uses a locale with a different encoding (e.g., Windows-1252 as is the common case for Windows machines located in the United States), then the output will contain a mix of UTF-8 and non-UTF-8 encoded text.
>>>>>>>
>>>>>>> The same problem occurs for C++ with:
>>>>>>>
>>>>>>> std::cout << "<text>: " << ws << "\n";
>>>>>>>
>>>>>>> WG21 has so far avoided these concerns with regard to char8_t, char16_t, and char32_t; no support is currently provided for formatting text in storage of these types with any of std::format(), std::print(), or iostreams. This limits the usability of these types and portable support for UTF encoded text in general.
>>>>>>>
>>>>>>> In this meeting, we'll discuss these concerns and the design space for improving the situation. Some items to consider:
>>>>>>>
>>>>>>> When designing std::format() and std::print(), WG21 has chosen to ignore the locale dependent execution encoding in several situations when the encoding of string literals is known to be UTF-8.
>>>>>>> Many implementations offer text conversion facilities as part of their I/O environment:
>>>>>>> Microsoft's fopen() implementation allows a file to be opened as text with a specified encoding. From Microsoft's documentation <https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/fopen-wfopen?view=msvc-170>:
>>>>>>> FILE *fp = fopen("newfile.txt", "rt+, ccs=UTF-8");
>>>>>>> GNU libc's fopen() implementation similarly allows an associated encoding via a "ccs" mode string. See Linux documentation <https://man7.org/linux/man-pages/man3/fopen.3.html>.
>>>>>>> IBM's z/OS allows an encoding to be associated with a file as a filesystem attribute. See IBM's chtag documentation <https://www.ibm.com/docs/en/zos/2.3.0?topic=descriptions-chtag-change-file-tag-information>. z/OS also supports associating an encoding and enabling conversions for file streams, but I wasn't able to find documentation just now.
>>>>>>> Tom.
>>>>>>>
>>>>>>> <sg16-2022-07-27.ics>
>>>>>>
>>>>>
>>>>>
>>>
>>
Received on 2022-07-26 22:28:07