Date: Sun, 31 Aug 2025 12:21:30 +0000
> In Linux command line arguments, cout, and file names are UTF-8. If I get handed a file name on the command line, I can just use it to open a file.
No, they are not.
Although this is a quite common and understandable confusion.
Displaying something and representing something are 2 related but different things.
You can create files in Linux with invalid utf-8 and invalid code points, the system just doesn't care at all.
When displaying something on screen, which is done by a different program, will try to interpret the sequence of bytes and try to display it as if it was unicode, but other than convention there's nothing to say that it must be unicode, and if it fails to display it doesn't care.
And I can even create a different desktop environment that would be able to read those and display something different.
Displaying a thing is different from "the thing".
In NTFS it's exactly the same thing but with more rules (slightly).
As for cout. cout has no underlying encoding other than data comes in octets.
The terminal, which is itself a different application decides how to interpret the data put on the cout.
It maybe that some terminals decide to interpret the stream as utf-8, but in the same system you can have different terminals assuming different encodings, and it's not unusual to have terminals that can even change the interpretation of the stream between different encodings on the fly.
The only "encoding" that exists in the cout is only by convention.
Even in EBCDIC systems there's nothing that is stopping me from writing a terminal application that assumes utf-8 streams and write applications that cout utf-8 and have it be displayed correctly in my utf-8 terminal.
In the C++ standard filepaths are neither utf-8 or utf-16, there is an opaque type, and mostly either char or wchar_t, implementation detail.
std::cout or std::print, similar, no encoding specified.
There's a "std::vprint_unicode" and non_unicode in the standard, but it is my opinion that those should have never been added to the standard, the argument is to remove them, because they make the exact same misunderstanding as above.
________________________________
From: Simon Schröder <wissensfrosch_at_[hidden]> on behalf of Simon Schröder <dr.simon.schroeder_at_[hidden]>
Sent: Sunday, August 31, 2025 8:59:11 AM
To: Tiago Freire <tmiguelf_at_[hidden]>
Cc: std-proposals_at_[hidden] <std-proposals_at_[hidden]>
Subject: Re: [std-proposals] char8_t aliasing and Unicode
> On Aug 30, 2025, at 9:42 PM, Tiago Freire <tmiguelf_at_[hidden]> wrote:
>
> I hope you don't mind, I'm going to migrate the topic to a new thread since this is diverging from floating point type aliasing, which is a different thing from Unicode support.
>
> I think there is an important detail that is overlooked here:
>
>> This is why I advocate for char8_t over char16_t for functions.
>
> What char8_t or char16_t functions?
>
> As far as I know, there aren't many APIs that are even text, much less Unicode.
> Sure, you have file system paths and std::cout, but those are not Unicode there are no "char8_t or char16_t" in this domain, even if we like to pretend that it is.
> You have some text conversion facilities, the functions to convert between encodings, those are fine, the standard can deal with those without a problem.
>
> If it's not one of those 3 categories or similar (ex. program arguments, environment variables, debug symbols; which don't exist in unicode), frankly speaking I don't want text in my API's.
>
> I don't know of the problem of which you speak in which the standard should provide preference of one over the other.
> Can you be more concrete here?
>
When I started programming (late 90s early 00s) I was perfectly fine with my local code page on Windows. Now, that I write commercial software this is not enough. And having used Linux: everything is just UTF-8 and I don’t have to think much about character encodings again.
In Linux command line arguments, cout, and file names are UTF-8. If I get handed a file name on the command line, I can just use it to open a file. And I can also print it on the command line. This works for every file name that might be used on the system. We already had random crashes because somebody (in Germany, with the proper code page) used umlauts in their file path on Windows.
Up to this point with what I have written this only means that any Unicode encoding would be useful. But, I would say we should only pick one to be the regular one supported everywhere. No novice programmer should be forced to choose between different encodings, but there should be guidance on the preferred one. Currently, std functions support char and wchar_t neither of which makes guarantees on Unicode (which is why we have charN_t). It would be helpful to extend those std function to support at least one Unicode encoding. We could add three Unicode encodings, but that would be a lot more work. Does C++ already have conversion functions between UTF-8/16/32? Conversion should be possible within the C++ standard library.
When I write to text files I want UTF-8. This saves a lot of headaches and most of the world has moved to UTF-8 for text files already. Also, this implies that I want to be able to read text files as UTF-8. Now that my strings (read from a text file) are already UTF-8 it makes a lot of sense to choose UTF-8 as the preferred Unicode encoding. If everything is UTF-8 I have far less problems. Windows can continue to use wchar_t (when we are using non-portable Windows APIs or legacy C++ standard functions) and IBM can continue to use char (???) for EBCDIC. Certainly, the C++ standard library would have to translate to OS specific encodings for file system APIs and terminal APIs. For those who care about performance they can still use the OS APIs or regular char and wchar_t.
std::filesystem already solves a lot of the problems for file handling. But, writing to the console is still a problem. If you are using different libraries, strings still is a problem when using a GUIs. E.g. libraries might hand down error strings (through std::expected or exceptions) that we might want to display to the user. Currently, this is a mess (haven’t run into this problem personally, but I see many potential problems). A single (preferred) Unicode encoding would make things easier.
I prefer not to have to constantly convert between encodings or even (in the normal case) have to think about encodings. I want to have my command line parameters as Unicode and be able to use them for filenames and write them to cout (maybe something named differently). With the restriction of reading/writing text files the best solution IMHO is UTF-8.
Libraries should then be written to work seamlessly for the user. The user is not supposed to adapt to conventions of libraries that made a bad choice (UCS-2) years ago. This doesn’t mean that libraries need to change their internal representation, but they should seamlessly convert to UTF-8 (if that is the Unicode encoding we pick).
No, they are not.
Although this is a quite common and understandable confusion.
Displaying something and representing something are 2 related but different things.
You can create files in Linux with invalid utf-8 and invalid code points, the system just doesn't care at all.
When displaying something on screen, which is done by a different program, will try to interpret the sequence of bytes and try to display it as if it was unicode, but other than convention there's nothing to say that it must be unicode, and if it fails to display it doesn't care.
And I can even create a different desktop environment that would be able to read those and display something different.
Displaying a thing is different from "the thing".
In NTFS it's exactly the same thing but with more rules (slightly).
As for cout. cout has no underlying encoding other than data comes in octets.
The terminal, which is itself a different application decides how to interpret the data put on the cout.
It maybe that some terminals decide to interpret the stream as utf-8, but in the same system you can have different terminals assuming different encodings, and it's not unusual to have terminals that can even change the interpretation of the stream between different encodings on the fly.
The only "encoding" that exists in the cout is only by convention.
Even in EBCDIC systems there's nothing that is stopping me from writing a terminal application that assumes utf-8 streams and write applications that cout utf-8 and have it be displayed correctly in my utf-8 terminal.
In the C++ standard filepaths are neither utf-8 or utf-16, there is an opaque type, and mostly either char or wchar_t, implementation detail.
std::cout or std::print, similar, no encoding specified.
There's a "std::vprint_unicode" and non_unicode in the standard, but it is my opinion that those should have never been added to the standard, the argument is to remove them, because they make the exact same misunderstanding as above.
________________________________
From: Simon Schröder <wissensfrosch_at_[hidden]> on behalf of Simon Schröder <dr.simon.schroeder_at_[hidden]>
Sent: Sunday, August 31, 2025 8:59:11 AM
To: Tiago Freire <tmiguelf_at_[hidden]>
Cc: std-proposals_at_[hidden] <std-proposals_at_[hidden]>
Subject: Re: [std-proposals] char8_t aliasing and Unicode
> On Aug 30, 2025, at 9:42 PM, Tiago Freire <tmiguelf_at_[hidden]> wrote:
>
> I hope you don't mind, I'm going to migrate the topic to a new thread since this is diverging from floating point type aliasing, which is a different thing from Unicode support.
>
> I think there is an important detail that is overlooked here:
>
>> This is why I advocate for char8_t over char16_t for functions.
>
> What char8_t or char16_t functions?
>
> As far as I know, there aren't many APIs that are even text, much less Unicode.
> Sure, you have file system paths and std::cout, but those are not Unicode there are no "char8_t or char16_t" in this domain, even if we like to pretend that it is.
> You have some text conversion facilities, the functions to convert between encodings, those are fine, the standard can deal with those without a problem.
>
> If it's not one of those 3 categories or similar (ex. program arguments, environment variables, debug symbols; which don't exist in unicode), frankly speaking I don't want text in my API's.
>
> I don't know of the problem of which you speak in which the standard should provide preference of one over the other.
> Can you be more concrete here?
>
When I started programming (late 90s early 00s) I was perfectly fine with my local code page on Windows. Now, that I write commercial software this is not enough. And having used Linux: everything is just UTF-8 and I don’t have to think much about character encodings again.
In Linux command line arguments, cout, and file names are UTF-8. If I get handed a file name on the command line, I can just use it to open a file. And I can also print it on the command line. This works for every file name that might be used on the system. We already had random crashes because somebody (in Germany, with the proper code page) used umlauts in their file path on Windows.
Up to this point with what I have written this only means that any Unicode encoding would be useful. But, I would say we should only pick one to be the regular one supported everywhere. No novice programmer should be forced to choose between different encodings, but there should be guidance on the preferred one. Currently, std functions support char and wchar_t neither of which makes guarantees on Unicode (which is why we have charN_t). It would be helpful to extend those std function to support at least one Unicode encoding. We could add three Unicode encodings, but that would be a lot more work. Does C++ already have conversion functions between UTF-8/16/32? Conversion should be possible within the C++ standard library.
When I write to text files I want UTF-8. This saves a lot of headaches and most of the world has moved to UTF-8 for text files already. Also, this implies that I want to be able to read text files as UTF-8. Now that my strings (read from a text file) are already UTF-8 it makes a lot of sense to choose UTF-8 as the preferred Unicode encoding. If everything is UTF-8 I have far less problems. Windows can continue to use wchar_t (when we are using non-portable Windows APIs or legacy C++ standard functions) and IBM can continue to use char (???) for EBCDIC. Certainly, the C++ standard library would have to translate to OS specific encodings for file system APIs and terminal APIs. For those who care about performance they can still use the OS APIs or regular char and wchar_t.
std::filesystem already solves a lot of the problems for file handling. But, writing to the console is still a problem. If you are using different libraries, strings still is a problem when using a GUIs. E.g. libraries might hand down error strings (through std::expected or exceptions) that we might want to display to the user. Currently, this is a mess (haven’t run into this problem personally, but I see many potential problems). A single (preferred) Unicode encoding would make things easier.
I prefer not to have to constantly convert between encodings or even (in the normal case) have to think about encodings. I want to have my command line parameters as Unicode and be able to use them for filenames and write them to cout (maybe something named differently). With the restriction of reading/writing text files the best solution IMHO is UTF-8.
Libraries should then be written to work seamlessly for the user. The user is not supposed to adapt to conventions of libraries that made a bad choice (UCS-2) years ago. This doesn’t mean that libraries need to change their internal representation, but they should seamlessly convert to UTF-8 (if that is the Unicode encoding we pick).
Received on 2025-08-31 12:21:34