Date: Wed, 8 May 2024 14:50:44 -0400
On 5/8/24 2:23 PM, Peter Dimov wrote:
> Tom Honermann wrote:
>> On 5/8/24 12:54 PM, Victor Zverovich wrote:
>>
>>
>> > The ASCII and EBCDIC code page based locale programming model
>> used on POSIX and Windows systems is not broken.
>>
>> It is actually broken on Windows for reasons explained in
>> https://www.open-
>> std.org/jtc1/sc22/wg21/docs/papers/2022/p2093r14.html.
>>
>> I'm not sure what you are referring to as broken. If you are referring to
>> characters not being displayed correctly in the Windows console due to the
>> console using a code page that is not aligned with the locale/environment by
>> default (because of backward compatibility with old DOS applications), then
>> yes, that is broken, but it is broken due to the inconsistent encoding selection,
>> not due to the code page based programming model. The same behavior
>> would be exhibited on Linux if the terminal encoding was changed to CP437.
> There's so much broken with the code page model, honestly.
It is important to distinguish between the code page model and the
standard library.
I heartily agree that the standard library is broken in multiple ways,
some of which you describe below.
>
> First and least important, a code page is wchar_t[256], which doesn't really
> match what important parts of the world use. This is burned into our
> ctype::widen model, making it not very useful.
>
> Second, every locale facet can have its own code page, there's no guarantee
> that those match, or agree.
Agreed. POSIX states that mixed locale categories is undefined behavior
<https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html>.
"If different character sets are used by the locale categories, the
results achieved by an application utilizing these categories are
undefined. Likewise, if different codesets are used for the data being
processed by interfaces whose behavior is dependent on the current
locale, or the codeset is different from the codeset assumed when the
locale was created, the result is also undefined."
>
> Third, there's additionally an "active ANSI code page" under Windows, which
> can also not match any of the locale facets.
>
> Fourth, there's the console code page, which almost certainly doesn't match
> any of the above.
>
> Fifth, user input (under Windows) and file names (on NTFS, HPFS+) use
> UTF-16, and as a result, may not even be representable in any of the above
> code pages.
That is true on POSIX systems as well. File names do not have strongly
associated encodings. Windows doesn't enforce valid UTF-16 file names.
>
> Sixth, the literal encoding is burned into the executable at compile time,
> but all of the above may arbitrarily vary at runtime.
>
> It's possible to make all this somehow work in a controlled environment,
> but it's not possible to write code that is robust against environmental
> changes.
I find the latter statement disproven by the large amount of code that
is so written and deployed all around the world.
>
> As I said before, we should take care of not breaking existing working
> code page-based code, but we shouldn't invest any effort in trying to make
> it possible for new code - which we know is new because it uses Unicode
> character types - to be written against a code page-based model.
I disagree. I think new character types can be particularly useful in
environments that cannot move to UTF-8.
>
> Our target should be code that does
>
> input encoding -> (program uses intermediate UTF-8 throughout) -> output encoding
I strongly agree that we should enable and promote that model.
>
> and, for compatibility's sake, we should specify the intermediate
> encoding as the narrow literal encoding, with the expectation that people
> who want to have reliably working programs will set their narrow literal
> encoding to UTF-8.
The "narrow literal encoding" (informally since it isn't a defined term)
encompasses both "" and u8"" these days. See [lex.charset]p8
<http://eel.is/c++draft/lex.charset#8> and [basic.fundamental]p7
<http://eel.is/c++draft/basic.fundamental#7> and
This keeps neglecting the basic fact that there are implementations and
ecosystems that cannot adopt what you are suggesting. Not now, not in
the near term, probably never.
>
> Under Windows, this enables code page-based code subject to the
> constraints that the narrow literal encoding, the ANSI code page,
> and the locale all agree.
Agreed.
Tom.
> Tom Honermann wrote:
>> On 5/8/24 12:54 PM, Victor Zverovich wrote:
>>
>>
>> > The ASCII and EBCDIC code page based locale programming model
>> used on POSIX and Windows systems is not broken.
>>
>> It is actually broken on Windows for reasons explained in
>> https://www.open-
>> std.org/jtc1/sc22/wg21/docs/papers/2022/p2093r14.html.
>>
>> I'm not sure what you are referring to as broken. If you are referring to
>> characters not being displayed correctly in the Windows console due to the
>> console using a code page that is not aligned with the locale/environment by
>> default (because of backward compatibility with old DOS applications), then
>> yes, that is broken, but it is broken due to the inconsistent encoding selection,
>> not due to the code page based programming model. The same behavior
>> would be exhibited on Linux if the terminal encoding was changed to CP437.
> There's so much broken with the code page model, honestly.
It is important to distinguish between the code page model and the
standard library.
I heartily agree that the standard library is broken in multiple ways,
some of which you describe below.
>
> First and least important, a code page is wchar_t[256], which doesn't really
> match what important parts of the world use. This is burned into our
> ctype::widen model, making it not very useful.
>
> Second, every locale facet can have its own code page, there's no guarantee
> that those match, or agree.
Agreed. POSIX states that mixed locale categories is undefined behavior
<https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html>.
"If different character sets are used by the locale categories, the
results achieved by an application utilizing these categories are
undefined. Likewise, if different codesets are used for the data being
processed by interfaces whose behavior is dependent on the current
locale, or the codeset is different from the codeset assumed when the
locale was created, the result is also undefined."
>
> Third, there's additionally an "active ANSI code page" under Windows, which
> can also not match any of the locale facets.
>
> Fourth, there's the console code page, which almost certainly doesn't match
> any of the above.
>
> Fifth, user input (under Windows) and file names (on NTFS, HPFS+) use
> UTF-16, and as a result, may not even be representable in any of the above
> code pages.
That is true on POSIX systems as well. File names do not have strongly
associated encodings. Windows doesn't enforce valid UTF-16 file names.
>
> Sixth, the literal encoding is burned into the executable at compile time,
> but all of the above may arbitrarily vary at runtime.
>
> It's possible to make all this somehow work in a controlled environment,
> but it's not possible to write code that is robust against environmental
> changes.
I find the latter statement disproven by the large amount of code that
is so written and deployed all around the world.
>
> As I said before, we should take care of not breaking existing working
> code page-based code, but we shouldn't invest any effort in trying to make
> it possible for new code - which we know is new because it uses Unicode
> character types - to be written against a code page-based model.
I disagree. I think new character types can be particularly useful in
environments that cannot move to UTF-8.
>
> Our target should be code that does
>
> input encoding -> (program uses intermediate UTF-8 throughout) -> output encoding
I strongly agree that we should enable and promote that model.
>
> and, for compatibility's sake, we should specify the intermediate
> encoding as the narrow literal encoding, with the expectation that people
> who want to have reliably working programs will set their narrow literal
> encoding to UTF-8.
The "narrow literal encoding" (informally since it isn't a defined term)
encompasses both "" and u8"" these days. See [lex.charset]p8
<http://eel.is/c++draft/lex.charset#8> and [basic.fundamental]p7
<http://eel.is/c++draft/basic.fundamental#7> and
This keeps neglecting the basic fact that there are implementations and
ecosystems that cannot adopt what you are suggesting. Not now, not in
the near term, probably never.
>
> Under Windows, this enables code page-based code subject to the
> constraints that the narrow literal encoding, the ANSI code page,
> and the locale all agree.
Agreed.
Tom.
Received on 2024-05-08 18:50:49