C++ Logo

std-proposals

Advanced search

Re: [std-proposals] char8_t aliasing and Unicode

From: Simon Schröder <dr.simon.schroeder_at_[hidden]>
Date: Sun, 31 Aug 2025 08:58:56 +0200
> On Aug 30, 2025, at 9:42 PM, Tiago Freire <tmiguelf_at_[hidden]> wrote:
>
> I hope you don't mind, I'm going to migrate the topic to a new thread since this is diverging from floating point type aliasing, which is a different thing from Unicode support.
>
> I think there is an important detail that is overlooked here:
>
>> This is why I advocate for char8_t over char16_t for functions.
>
> What char8_t or char16_t functions?
>
> As far as I know, there aren't many APIs that are even text, much less Unicode.
> Sure, you have file system paths and std::cout, but those are not Unicode there are no "char8_t or char16_t" in this domain, even if we like to pretend that it is.
> You have some text conversion facilities, the functions to convert between encodings, those are fine, the standard can deal with those without a problem.
>
> If it's not one of those 3 categories or similar (ex. program arguments, environment variables, debug symbols; which don't exist in unicode), frankly speaking I don't want text in my API's.
>
> I don't know of the problem of which you speak in which the standard should provide preference of one over the other.
> Can you be more concrete here?
>
When I started programming (late 90s early 00s) I was perfectly fine with my local code page on Windows. Now, that I write commercial software this is not enough. And having used Linux: everything is just UTF-8 and I don’t have to think much about character encodings again.

In Linux command line arguments, cout, and file names are UTF-8. If I get handed a file name on the command line, I can just use it to open a file. And I can also print it on the command line. This works for every file name that might be used on the system. We already had random crashes because somebody (in Germany, with the proper code page) used umlauts in their file path on Windows.

Up to this point with what I have written this only means that any Unicode encoding would be useful. But, I would say we should only pick one to be the regular one supported everywhere. No novice programmer should be forced to choose between different encodings, but there should be guidance on the preferred one. Currently, std functions support char and wchar_t neither of which makes guarantees on Unicode (which is why we have charN_t). It would be helpful to extend those std function to support at least one Unicode encoding. We could add three Unicode encodings, but that would be a lot more work. Does C++ already have conversion functions between UTF-8/16/32? Conversion should be possible within the C++ standard library.

When I write to text files I want UTF-8. This saves a lot of headaches and most of the world has moved to UTF-8 for text files already. Also, this implies that I want to be able to read text files as UTF-8. Now that my strings (read from a text file) are already UTF-8 it makes a lot of sense to choose UTF-8 as the preferred Unicode encoding. If everything is UTF-8 I have far less problems. Windows can continue to use wchar_t (when we are using non-portable Windows APIs or legacy C++ standard functions) and IBM can continue to use char (???) for EBCDIC. Certainly, the C++ standard library would have to translate to OS specific encodings for file system APIs and terminal APIs. For those who care about performance they can still use the OS APIs or regular char and wchar_t.

std::filesystem already solves a lot of the problems for file handling. But, writing to the console is still a problem. If you are using different libraries, strings still is a problem when using a GUIs. E.g. libraries might hand down error strings (through std::expected or exceptions) that we might want to display to the user. Currently, this is a mess (haven’t run into this problem personally, but I see many potential problems). A single (preferred) Unicode encoding would make things easier.

I prefer not to have to constantly convert between encodings or even (in the normal case) have to think about encodings. I want to have my command line parameters as Unicode and be able to use them for filenames and write them to cout (maybe something named differently). With the restriction of reading/writing text files the best solution IMHO is UTF-8.

Libraries should then be written to work seamlessly for the user. The user is not supposed to adapt to conventions of libraries that made a bad choice (UCS-2) years ago. This doesn’t mean that libraries need to change their internal representation, but they should seamlessly convert to UTF-8 (if that is the Unicode encoding we pick).

Received on 2025-08-31 06:59:10