On 7/28/22 8:05 AM, Marcus Johnson via SG16 wrote:
Hey Tom, been thinking about that fread/fwrite encoding issue, wrote something up real quick, though it operates on file handles not descriptors, what do you think though?

I think we should aim for something that more closely follows existing practice and that doesn't require a hard-coded list of encodings. For reference, see:

These implementations suggest we can make progress by:

  1. Extending fopen() to support the "ccs=encoding" flag with an implementation-defined set of encodings that minimally includes UTF-8 (and perhaps others). Presumably, the Microsoft and GNU libc implementations would already be conforming. z/OS would have to be modified to support the flag (presumably by modifying fopen() to as-if call fcntl() in its implementation).
  2. Providing an interface to query (and perhaps set) the stream encoding; presumably something very similar to what z/OS offers with its fcntl() interface.

Coupling the above with your N3016 (in which UTF encoded inputs are converted to the locale encoding) would suffice to enable portable code to produce well-formed output in arbitrary encodings. For example, the following would reliably produce well-formed UTF-8 via a two step conversion process in which printf() produces MBCS (MultiByte Character Set; locale encoding) encoded characters that are then converted to UTF-8. Logically, the conversions can be described as noted in the comments.

#include <stdio.h>
int main() {
  FILE *fp = fopen("file.txt", "w,ccs=utf-8");
  fprintf(fp, "text, %s, %ls, %U8s, %U16s, %U32s", // MBCS -> UTF-8 (for non-field specifier characters).
          "text",    // MBCS           -> UTF-8.
          L"text",   // Wide   -> MBCS -> UTF-8.
          u8"text",  // UTF-8  -> MBCS -> UTF-8.
          u"text",   // UTF-16 -> MBCS -> UTF-8.
          U"text");  // UTF-32 -> MBCS -> UTF-8.
  fclose(fp);
}

As an optimization, implementations could skip the (potentially lossy) conversion to the MBCS encoding by converting the format string and inputs directly to the encoding associated with the file stream and then bypassing the lower level conversions that would otherwise be performed. I don't know if any implementations attempt such optimizations today.

This approach has the downside that it only enables conversion to an arbitrary encoding for the formatted I/O functions that read/write through FILE handles. sprintf(), snprintf(), and extensions like dprintf() would still be limited to producing MBCS encoded text. But that might be ok.

Please start a new email thread with an appropriate subject for any further discussion (feel free to copy anything relevant from this thread when doing so).

Tom.


Basically just fwide, but for encodings

typedef enum fCharacterSetModes {
    fCharacterSetMode_ReadValue = 0,
    fCharacterSetMode_PrependBOM = 1,
    fCharacterSetMode_Ascii = 2,
    fCharacterSetMode_UTF8 = 3,
    fCharacterSetMode_UTF8_BOM = 4,
    fCharacterSetMode_UTF16_LE = 5,
    fCharacterSetMode_UTF16_LE_BOM = 6,
    fCharacterSetMode_UTF16_BE = 7,
    fCharacterSetMode_UTF16_BE_BOM = 8,
    fCharacterSetMode_UTF32_LE = 9,
    fCharacterSetMode_UTF32_LE_BOM = 10,
    fCharacterSetMode_UTF32_BE = 11,
    fCharacterSetMode_UTF32_BE_BOM = 12,
    // Any other value higher is reserved for implementations to say their own code page values.
} fCharacterSetModes;

fCharacterSetModes fUnicode(FILE *Stream, fUnicodeModes Mode);