Hey Tom, been thinking about that fread/fwrite encoding issue, wrote something up real quick, though it operates on file handles not descriptors, what do you think though?
I think we should aim for something that more closely follows existing practice and that doesn't require a hard-coded list of encodings. For reference, see:
These implementations suggest we can make progress by:
Coupling the above with your N3016
(in which UTF encoded inputs are converted to the locale encoding)
would suffice to enable portable code to produce well-formed
output in arbitrary encodings. For example, the following would
reliably produce well-formed UTF-8 via a two step conversion
process in which printf() produces
MBCS (MultiByte Character Set; locale encoding) encoded characters
that are then converted to UTF-8. Logically, the conversions can
be described as noted in the comments.
#include <stdio.h>
int main() {
FILE *fp = fopen("file.txt", "w,ccs=utf-8");
fprintf(fp, "text, %s, %ls, %U8s, %U16s, %U32s", // MBCS -> UTF-8 (for non-field specifier characters).
"text", // MBCS -> UTF-8.
L"text", // Wide -> MBCS -> UTF-8.
u8"text", // UTF-8 -> MBCS -> UTF-8.
u"text", // UTF-16 -> MBCS -> UTF-8.
U"text"); // UTF-32 -> MBCS -> UTF-8.
fclose(fp);
}
As an optimization, implementations could skip the (potentially
lossy) conversion to the MBCS encoding by converting the format
string and inputs directly to the encoding associated with the
file stream and then bypassing the lower level conversions that
would otherwise be performed. I don't know if any implementations
attempt such optimizations today.
This approach has the downside that it only enables conversion to an arbitrary encoding for the formatted I/O functions that read/write through FILE handles. sprintf(), snprintf(), and extensions like dprintf() would still be limited to producing MBCS encoded text. But that might be ok.
Please start a new email thread with an appropriate subject for
any further discussion (feel free to copy anything relevant from
this thread when doing so).
Tom.
Basically just fwide, but for encodings
typedef enum fCharacterSetModes {fCharacterSetMode_ReadValue = 0,fCharacterSetMode_PrependBOM = 1,fCharacterSetMode_Ascii = 2,fCharacterSetMode_UTF8 = 3,fCharacterSetMode_UTF8_BOM = 4,fCharacterSetMode_UTF16_LE = 5,fCharacterSetMode_UTF16_LE_BOM = 6,fCharacterSetMode_UTF16_BE = 7,fCharacterSetMode_UTF16_BE_BOM = 8,fCharacterSetMode_UTF32_LE = 9,fCharacterSetMode_UTF32_LE_BOM = 10,fCharacterSetMode_UTF32_BE = 11,fCharacterSetMode_UTF32_BE_BOM = 12,// Any other value higher is reserved for implementations to say their own code page values.} fCharacterSetModes;
fCharacterSetModes fUnicode(FILE *Stream, fUnicodeModes Mode);