ISOCPP sg16 List: Re: [SC22WG14.22429] Agenda for the 2022-07-27 SG16 telecon

From: Tom Honermann <tom_at_[hidden]>
Date: Thu, 28 Jul 2022 17:04:35 -0400

On 7/28/22 8:05 AM, Marcus Johnson via SG16 wrote:
> Hey Tom, been thinking about that fread/fwrite encoding issue, wrote
> something up real quick, though it operates on file handles not
> descriptors, what do you think though?

I think we should aim for something that more closely follows existing
practice and that doesn't require a hard-coded list of encodings. For
reference, see:

  * Microsoft's fopen() documentation
    <https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/fopen-wfopen?view=msvc-170>
    and the "ccs=encoding" flag in the "mode" argument.
  * GNU libc's fopen() documentation
    <https://www.gnu.org/software/libc/manual/html_node/Opening-Streams.html>
    and the "ccs=STRING" flag in the "opentype" argument.
  * z/OS' fcntl() documentation
    <https://www.ibm.com/docs/en/zos/2.5.0?topic=SSLTBW_2.5.0/com.ibm.zos.v2r5.bpxbd00/rtfcndesc.htm>
    and the F_SETTAG and F_CONTROL_CVT commands and the f_cnvrt type.

These implementations suggest we can make progress by:

1. Extending fopen() to support the "ccs=encoding" flag with an
    implementation-defined set of encodings that minimally includes
    UTF-8 (and perhaps others). Presumably, the Microsoft and GNU libc
    implementations would already be conforming. z/OS would have to be
    modified to support the flag (presumably by modifying fopen() to
    as-if call fcntl() in its implementation).
2. Providing an interface to query (and perhaps set) the stream
    encoding; presumably something very similar to what z/OS offers with
    its fcntl() interface.

Coupling the above with your N3016
<https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3016.pdf> (in which
UTF encoded inputs are converted to the locale encoding) would suffice
to enable portable code to produce well-formed output in arbitrary
encodings. For example, the following would reliably produce well-formed
UTF-8 via a two step conversion process in which printf() produces MBCS
(MultiByte Character Set; locale encoding) encoded characters that are
then converted to UTF-8. Logically, the conversions can be described as
noted in the comments.

    #include <stdio.h>
    int main() {
       FILE *fp = fopen("file.txt", "w,*ccs=utf-8*");
       fprintf(fp, "text, %s, %ls, %U8s, %U16s, %U32s", // MBCS -> UTF-8
    (for non-field specifier characters).
               "text", // MBCS -> UTF-8.
               L"text", // Wide -> MBCS -> UTF-8.
               u8"text", // UTF-8 -> MBCS -> UTF-8.
               u"text", // UTF-16 -> MBCS -> UTF-8.
               U"text"); // UTF-32 -> MBCS -> UTF-8.
       fclose(fp);
    }

As an optimization, implementations could skip the (potentially lossy)
conversion to the MBCS encoding by converting the format string and
inputs directly to the encoding associated with the file stream and then
bypassing the lower level conversions that would otherwise be performed.
I don't know if any implementations attempt such optimizations today.

This approach has the downside that it only enables conversion to an
arbitrary encoding for the formatted I/O functions that read/write
through FILE handles. sprintf(), snprintf(), and extensions like
dprintf() would still be limited to producing MBCS encoded text. But
that might be ok.

Please start a new email thread with an appropriate subject for any
further discussion (feel free to copy anything relevant from this thread
when doing so).

Tom.

>
> Basically just fwide, but for encodings
>
> *typedef enum fCharacterSetModes {*
> fCharacterSetMode_ReadValue = 0,
> fCharacterSetMode_PrependBOM = 1,
> fCharacterSetMode_Ascii = 2,
> fCharacterSetMode_UTF8 = 3,
> fCharacterSetMode_UTF8_BOM = 4,
> fCharacterSetMode_UTF16_LE = 5,
> fCharacterSetMode_UTF16_LE_BOM = 6,
> fCharacterSetMode_UTF16_BE = 7,
> fCharacterSetMode_UTF16_BE_BOM = 8,
> fCharacterSetMode_UTF32_LE = 9,
> fCharacterSetMode_UTF32_LE_BOM = 10,
> fCharacterSetMode_UTF32_BE = 11,
> fCharacterSetMode_UTF32_BE_BOM = 12,
> // Any other value higher is reserved for implementations to say
> their own code page values.
> } fCharacterSetModes;
> *
> fCharacterSetModes fUnicode(FILE *Stream, fUnicodeModes Mode);
> *
>

Received on 2022-07-28 21:04:37