C++ Logo

std-proposals

Advanced search

Re: [std-proposals] char8_t aliasing and Unicode

From: Simon Schröder <dr.simon.schroeder_at_[hidden]>
Date: Sun, 31 Aug 2025 15:56:04 +0200
> On Aug 31, 2025, at 2:21 PM, Tiago Freire <tmiguelf_at_[hidden]> wrote:
>
> 
> > In Linux command line arguments, cout, and file names are UTF-8. If I get handed a file name on the command line, I can just use it to open a file.
>
> No, they are not.
> Although this is a quite common and understandable confusion.
> Displaying something and representing something are 2 related but different things.
> You can create files in Linux with invalid utf-8 and invalid code points, the system just doesn't care at all.
> When displaying something on screen, which is done by a different program, will try to interpret the sequence of bytes and try to display it as if it was unicode, but other than convention there's nothing to say that it must be unicode, and if it fails to display it doesn't care.
> And I can even create a different desktop environment that would be able to read those and display something different.
> Displaying a thing is different from "the thing".
> In NTFS it's exactly the same thing but with more rules (slightly).

I would claim that for all intents and purposes filenames are Unicode. At least this is true for the files the user interacts with. With normal interfaces using your keyboard nobody will create non-Unicode filenames. We already have kind of a convention here to use Unicode filenames. This is because we are using file browsers and the list command on the terminal to show the contents of a folder. All software assumes Unicode and so in reality filenames are Unicode (or can be converted to Unicode).
>
> As for cout. cout has no underlying encoding other than data comes in octets.
> The terminal, which is itself a different application decides how to interpret the data put on the cout.
> It maybe that some terminals decide to interpret the stream as utf-8, but in the same system you can have different terminals assuming different encodings, and it's not unusual to have terminals that can even change the interpretation of the stream between different encodings on the fly.
> The only "encoding" that exists in the cout is only by convention.
> Even in EBCDIC systems there's nothing that is stopping me from writing a terminal application that assumes utf-8 streams and write applications that cout utf-8 and have it be displayed correctly in my utf-8 terminal.

So far I have used cout as a generalization. I would not propose to actually change the meaning of std::cout based on char. To be more precise: Currently, we have cout and wcout. We could add an overload for at least one Unicode encoding (or even one unified interface that allows all Unicode encodings). One of C++ tenets is to never break existing code. cout and wcout should always work the same. We might want to keep them around because of performance reasons (you don’t pay for what you don’t use). It would be easy to require cout to use some form of Unicode output under the hood, but that would mean that other encodings would need to be always converted. This would not fit C++ and we shouldn’t do it.
>
>
> In the C++ standard filepaths are neither utf-8 or utf-16, there is an opaque type, and mostly either char or wchar_t, implementation detail.

cppreference.com states that filepaths internally use char on POSIX and wchar_t on Windows. Again, for all intents and purposes this works with all valid Unicode filenames (and as you say even with more filenames). Why I mentioned that std::filesystem works with Unicode is because e.g. std::filesystem::path’s constructor does take all the character types which includes the specific ones for Unicode encodings. This means if you want to work with “Unicode-only” filenames, C++ currently provides good support for that. At least for non-embedded systems I don’t see why we would want anything else than that.

> std::cout or std::print, similar, no encoding specified.

You are right that cout does not specify the encoding. However, according to cppreference.com std::print provides specific (but not exclusively) support for UTF-8. Effectively, on all major OSes there is a way to write Unicode to the terminal. And std::print provides a unified way to use it. One role the standard library plays is to provide unified access to specific features to users so that they can write portable code and don’t have to reinvent the wheel over and over again. Unicode solves a lot of problems in theory. In practice, the discussion about Unicode encodings prevents Unicode to actually solve these problems in practice, as well. (Yes, Unicode does not solve ALL problems, but it is currently the best solution we have.)
>
> There's a "std::vprint_unicode" and non_unicode in the standard, but it is my opinion that those should have never been added to the standard, the argument is to remove them, because they make the exact same misunderstanding as above.
>
This is obviously where our opinions differ. (Which is fine.)
>

Received on 2025-08-31 13:56:18