ISOCPP sg16 List: Re: Agenda for the 2024-02-21 SG16 meeting

From: Tiago Freire <tmiguelf_at_[hidden]>
Date: Tue, 20 Feb 2024 09:28:55 +0000

> That LWG4044 reads like somebody wrote it thinking of Windows first and adding the rest as an afterthought. I'd agree with Jonathan on the resolution but would like to adjust the approach:
> - For platforms that have separate methods for outputting Unicode and non-Unicode text, it should determine if the output is to a Unicode terminal and use the appropriate API, flushing the other API if necessary.
> - For platforms that have a single Unicode-compatible output, just use the output.
> but in legalese. Splitting platforms on whether or not they are Windows (except in a non-Windows way) first, and only then adding complexity required for those platforms, seems like the best way to help implementers avoid the complexity if it's not necessary. As with Corentin's email (that just came in), help platforms other than Windows avoid all complexity, and give Windows the space to do its runtime debugging hooks and required conversions for Unicode so it will work properly.

Actually, the problem is much much worse, this is a major defect.
The way it is written seems to imply that one of them works with Unicode and the other doesn’t. When actually both of them or neither of them can do that.

You can change the way code points are interpreted on your windows terminal.

You “can” use the command chcp 65001 to change the console to interpret the stream as UTF-8.
I mean you “used to”, until a couple of months when Windows rolled-out an update, where previously the console codepoint was inherited by new running application,
and the update just broke that by always reverting to the default codepage when a new application is created, effectively breaking all apps that used to work relying on this feature (why Windows? Why?).
So, depending on the version you are using you may or may not be able to see this.
But if you have the update, you can still change the code page for your application using the function “SetConsoleOutputCP(65001);”.
Or you can set “Use Unicode UTF-8 for worldwide language support” setting hidden deep in your Regional Language settings of your OS.

WriteConsoleW doesn’t write to a Unicode stream, nor does it write to the same stream converting the input to Unicode, it writes to “Console Stream Buffer” if one is available (you can create one yourself https://learn.microsoft.com/en-us/windows/console/createconsolescreenbuffer). One is not necessarily always available (if you are using that API you better be testing for that).

If I’m creating a my own custom console application I am decide not to provide you one, and if I decide to provide you with one, there’s no guarantee that the way the buffer is interpreted is UTF16, it is totally legal for me to interpret it as a different 2byte encoding system. The API doesn’t care, and doesn’t validate that your “UTF16” data stream is actually valid UTF16 (the file system has the exact same problem).

The point is, encoding of the stream is only a thing at the very last moment when your console application decides to print your sequence of bytes/doublets to the screen, and never before.
And there’s no guarantee anywhere (as far as C++ can control) that either of them is Unicode, or is not Unicode.

This was a major pain point on an application I was working on. Current references to Unicode in the standard right now, are just flat-out wrong! Things don’t actually work this way.

Received on 2024-02-20 09:28:58