On Aug 31, 2025, at 2:21 PM, Tiago Freire <tmiguelf@hotmail.com> wrote:
> In Linux command line arguments, cout, and file names are UTF-8. If I get handed a file name on the command line, I can just use it to open a file.
No, they are not.Although this is a quite common and understandable confusion.Displaying something and representing something are 2 related but different things.You can create files in Linux with invalid utf-8 and invalid code points, the system just doesn't care at all.When displaying something on screen, which is done by a different program, will try to interpret the sequence of bytes and try to display it as if it was unicode, but other than convention there's nothing to say that it must be unicode, and if it fails to display it doesn't care.And I can even create a different desktop environment that would be able to read those and display something different.Displaying a thing is different from "the thing".In NTFS it's exactly the same thing but with more rules (slightly).
As for cout. cout has no underlying encoding other than data comes in octets.The terminal, which is itself a different application decides how to interpret the data put on the cout.It maybe that some terminals decide to interpret the stream as utf-8, but in the same system you can have different terminals assuming different encodings, and it's not unusual to have terminals that can even change the interpretation of the stream between different encodings on the fly.The only "encoding" that exists in the cout is only by convention.Even in EBCDIC systems there's nothing that is stopping me from writing a terminal application that assumes utf-8 streams and write applications that cout utf-8 and have it be displayed correctly in my utf-8 terminal.
In the C++ standard filepaths are neither utf-8 or utf-16, there is an opaque type, and mostly either char or wchar_t, implementation detail.
std::cout or std::print, similar, no encoding specified.
This is obviously where our opinions differ. (Which is fine.)
There's a "std::vprint_unicode" and non_unicode in the standard, but it is my opinion that those should have never been added to the standard, the argument is to remove them, because they make the exact same misunderstanding as above.