Tom Honermann wrote:On 5/8/24 12:54 PM, Victor Zverovich wrote: > The ASCII and EBCDIC code page based locale programming model used on POSIX and Windows systems is not broken. It is actually broken on Windows for reasons explained in https://www.open- std.org/jtc1/sc22/wg21/docs/papers/2022/p2093r14.html. I'm not sure what you are referring to as broken. If you are referring to characters not being displayed correctly in the Windows console due to the console using a code page that is not aligned with the locale/environment by default (because of backward compatibility with old DOS applications), then yes, that is broken, but it is broken due to the inconsistent encoding selection, not due to the code page based programming model. The same behavior would be exhibited on Linux if the terminal encoding was changed to CP437.There's so much broken with the code page model, honestly.
It is important to distinguish between the code page model and the standard library.
I heartily agree that the standard library is broken in multiple
ways, some of which you describe below.
First and least important, a code page is wchar_t[256], which doesn't really match what important parts of the world use. This is burned into our ctype::widen model, making it not very useful. Second, every locale facet can have its own code page, there's no guarantee that those match, or agree.
Agreed. POSIX
states that mixed locale categories is undefined behavior.
"If different character sets are used by the locale categories,
the results achieved by an application utilizing these categories
are undefined. Likewise, if different codesets are used for the
data being processed by interfaces whose behavior is dependent on
the current locale, or the codeset is different from the codeset
assumed when the locale was created, the result is also
undefined."
That is true on POSIX systems as well. File names do not have strongly associated encodings. Windows doesn't enforce valid UTF-16 file names.Third, there's additionally an "active ANSI code page" under Windows, which can also not match any of the locale facets. Fourth, there's the console code page, which almost certainly doesn't match any of the above. Fifth, user input (under Windows) and file names (on NTFS, HPFS+) use UTF-16, and as a result, may not even be representable in any of the above code pages.
I find the latter statement disproven by the large amount of code that is so written and deployed all around the world.Sixth, the literal encoding is burned into the executable at compile time, but all of the above may arbitrarily vary at runtime. It's possible to make all this somehow work in a controlled environment, but it's not possible to write code that is robust against environmental changes.
I disagree. I think new character types can be particularly useful in environments that cannot move to UTF-8.As I said before, we should take care of not breaking existing working code page-based code, but we shouldn't invest any effort in trying to make it possible for new code - which we know is new because it uses Unicode character types - to be written against a code page-based model.
I strongly agree that we should enable and promote that model.Our target should be code that does input encoding -> (program uses intermediate UTF-8 throughout) -> output encoding
and, for compatibility's sake, we should specify the intermediate encoding as the narrow literal encoding, with the expectation that people who want to have reliably working programs will set their narrow literal encoding to UTF-8.
The "narrow literal encoding" (informally since it isn't a
defined term) encompasses both "" and u8"" these days. See [lex.charset]p8
and [basic.fundamental]p7
and
This keeps neglecting the basic fact that there are
implementations and ecosystems that cannot adopt what you are
suggesting. Not now, not in the near term, probably never.
Under Windows, this enables code page-based code subject to the constraints that the narrow literal encoding, the ANSI code page, and the locale all agree.
Agreed.
Tom.