ISOCPP sg16 List: Re: [isocpp-sg16] Follow up on SG16 review of P2996R2 (Reflection for C++26)

From: Peter Dimov <pdimov_at_[hidden]>
Date: Wed, 8 May 2024 21:23:44 +0300

Tom Honermann wrote:
> On 5/8/24 12:54 PM, Victor Zverovich wrote:
>
>
> > The ASCII and EBCDIC code page based locale programming model
> used on POSIX and Windows systems is not broken.
>
> It is actually broken on Windows for reasons explained in
> https://www.open-
> std.org/jtc1/sc22/wg21/docs/papers/2022/p2093r14.html.
>
> I'm not sure what you are referring to as broken. If you are referring to
> characters not being displayed correctly in the Windows console due to the
> console using a code page that is not aligned with the locale/environment by
> default (because of backward compatibility with old DOS applications), then
> yes, that is broken, but it is broken due to the inconsistent encoding selection,
> not due to the code page based programming model. The same behavior
> would be exhibited on Linux if the terminal encoding was changed to CP437.

There's so much broken with the code page model, honestly.

First and least important, a code page is wchar_t[256], which doesn't really
match what important parts of the world use. This is burned into our
ctype::widen model, making it not very useful.

Second, every locale facet can have its own code page, there's no guarantee
that those match, or agree.

Third, there's additionally an "active ANSI code page" under Windows, which
can also not match any of the locale facets.

Fourth, there's the console code page, which almost certainly doesn't match
any of the above.

Fifth, user input (under Windows) and file names (on NTFS, HPFS+) use
UTF-16, and as a result, may not even be representable in any of the above
code pages.

Sixth, the literal encoding is burned into the executable at compile time,
but all of the above may arbitrarily vary at runtime.

It's possible to make all this somehow work in a controlled environment,
but it's not possible to write code that is robust against environmental
changes.

As I said before, we should take care of not breaking existing working
code page-based code, but we shouldn't invest any effort in trying to make
it possible for new code - which we know is new because it uses Unicode
character types - to be written against a code page-based model.

Our target should be code that does

input encoding -> (program uses intermediate UTF-8 throughout) -> output encoding

and, for compatibility's sake, we should specify the intermediate
encoding as the narrow literal encoding, with the expectation that people
who want to have reliably working programs will set their narrow literal
encoding to UTF-8.

Under Windows, this enables code page-based code subject to the
constraints that the narrow literal encoding, the ANSI code page,
and the locale all agree.

Received on 2024-05-08 18:23:49