On 5/8/24 2:23 PM, Peter Dimov wrote:

Tom Honermann wrote:

On 5/8/24 12:54 PM, Victor Zverovich wrote:


	> The ASCII and EBCDIC code page based locale programming model
used on POSIX and Windows systems is not broken.

	It is actually broken on Windows for reasons explained in
https://www.open-
std.org/jtc1/sc22/wg21/docs/papers/2022/p2093r14.html.

I'm not sure what you are referring to as broken. If you are referring to
characters not being displayed correctly in the Windows console due to the
console using a code page that is not aligned with the locale/environment by
default (because of backward compatibility with old DOS applications), then
yes, that is broken, but it is broken due to the inconsistent encoding selection,
not due to the code page based programming model. The same behavior
would be exhibited on Linux if the terminal encoding was changed to CP437.

There's so much broken with the code page model, honestly.

It is important to distinguish between the code page model and the standard library.

I heartily agree that the standard library is broken in multiple ways, some of which you describe below.


First and least important, a code page is wchar_t[256], which doesn't really
match what important parts of the world use. This is burned into our
ctype::widen model, making it not very useful.

Second, every locale facet can have its own code page, there's no guarantee
that those match, or agree.

Agreed. POSIX states that mixed locale categories is undefined behavior.

"If different character sets are used by the locale categories, the results achieved by an application utilizing these categories are undefined. Likewise, if different codesets are used for the data being processed by interfaces whose behavior is dependent on the current locale, or the codeset is different from the codeset assumed when the locale was created, the result is also undefined."


Third, there's additionally an "active ANSI code page" under Windows, which
can also not match any of the locale facets.

Fourth, there's the console code page, which almost certainly doesn't match
any of the above.

Fifth, user input (under Windows) and file names (on NTFS, HPFS+) use
UTF-16, and as a result, may not even be representable in any of the above
code pages.

That is true on POSIX systems as well. File names do not have strongly associated encodings. Windows doesn't enforce valid UTF-16 file names.


Sixth, the literal encoding is burned into the executable at compile time,
but all of the above may arbitrarily vary at runtime.

It's possible to make all this somehow work in a controlled environment,
but it's not possible to write code that is robust against environmental
changes.

I find the latter statement disproven by the large amount of code that is so written and deployed all around the world.


As I said before, we should take care of not breaking existing working
code page-based code, but we shouldn't invest any effort in trying to make
it possible for new code - which we know is new because it uses Unicode
character types - to be written against a code page-based model.

I disagree. I think new character types can be particularly useful in environments that cannot move to UTF-8.


Our target should be code that does

input encoding -> (program uses intermediate UTF-8 throughout) -> output encoding

I strongly agree that we should enable and promote that model.


and, for compatibility's sake, we should specify the intermediate
encoding as the narrow literal encoding, with the expectation that people
who want to have reliably working programs will set their narrow literal
encoding to UTF-8.

The "narrow literal encoding" (informally since it isn't a defined term) encompasses both "" and u8"" these days. See [lex.charset]p8 and [basic.fundamental]p7 and

This keeps neglecting the basic fact that there are implementations and ecosystems that cannot adopt what you are suggesting. Not now, not in the near term, probably never.


Under Windows, this enables code page-based code subject to the
constraints that the narrow literal encoding, the ANSI code page,
and the locale all agree.

Agreed.

Tom.