During the recent SG16 reviews of P2093 during the 2020-11-11
and 2020-12-09
telecons and on the SG16 mailing list (here,
continued here),
I expressed a preference for the proposed std::print()
to transcode as necessary to match the system/run-time encoding
when output is not directly connected to a terminal/console and
that, on Windows, this would mean converting output to match the
Active Code Page (ACP). The following tests indicate that the
situation is more complicated on Windows.
My preference for matching the system/run-time encoding is motivated by process interactions as occurs in command pipe lines. The following experiments were conducted to investigate current behavior and existing practice.
All code examples were compiled using the Microsoft Visual C++
2019 x64 compiler. All experiments were performed on a Windows 10
system configured with region system locale settings set to
"English (United States)" (ACP as Windows-1252, console encoding
default as CP437). Note that these experiments are not concerned
with the proper display of text in the Windows console, but rather
with encoding expectations in piped text.
The first thing I tried was writing a simple program that writes
"téxt" in each of the UTF-8, Windows-1252, and CP437 encodings and
then checking to see which encoding was matched by the Windows find
and findstr utilities. The following code was saved as
print-text.cpp and compiled with cl
/Feprint-text.cpp /EHsc print-text.cpp.
#include <iostream>
int main() {
std::cout << "UTF-8: t\xC3\xA9xt\n";
std::cout << "W1252: t\xE9xt\n";
std::cout << "CP437: t\x82xt\n";
}
Here are the results with the console encoding set to the default
CP437. Both find and findstr match the CP437
encoded text.
>chcp
Active code page: 437
>print-text | find "téxt"
CP437: téxt
C:\cygwin\home\Tom\test\test-console>print-text | findstr "téxt"
CP437: téxt
Changing the console encoding to Windows-1252 produced the first
surprising result. find now matched the Windows-1252
encoded text, but findstr continued to match the CP437
text. The incorrect display of the CP437 encoded text for the findstr
case is expected due to the console encoding change.
>chcp 1252
Active code page: 1252
>print-text | find "téxt"
W1252: téxt
>print-text | findstr "téxt"
CP437: t‚xt
Changing the console encoding to UTF-8 produced a similar result.
>chcp 65001
Active code page: 65001
>print-text | find "téxt"
UTF-8: téxt
>print-text | findstr "téxt"
CP437: txt
Conclusions:
The previous experiment probed what encoding is expected for input by the find and findstr utilities. This next one probed what encoding is used when producing output that will be input to another process.
Consider a utility that displays a message provided either on the command line or via stdin. The following source code implements such a utility:
#include <windows.h>
#include <iostream>
#include <string>
void usage(const char *program_name) {
std::cerr << "Usage: " << program_name << " [<message>]\n";
}
int main(int argc, char **argv) {
if (argc > 2) {
usage(argv[0]);
return 1;
}
std::string message_text;
if (argc > 1) {
message_text = argv[1];
} else {
std::getline(std::cin, message_text);
}
MessageBox(NULL, message_text.c_str(), "Message", MB_OK);
}
This code was saved as display-message.cpp and compiled with cl /Fedisplay-message.exe /EHsc display-message.cpp user32.lib to produce an executable named display-message.exe. Note that the source code contains no non-ASCII characters, so compilation with the /utf-8 option would currently have no effect.
The console encoding was reset to the default of CP437 before running the utility.
>chcp 437
Active code page: 437
The utility was then run with the following two commands with the
expectation that each would produce the same observed behavior.
> display-message "I like pâté"
> echo "I like pâté" | display-message
I was surprised to find that different behavior was exhibited.
In the first case, the message was displayed as expected. What
happens in this case is that a UTF-16 encoded command line is
constructed and provided to the new process. That command line is
then transcoded (presumably by mainCRTStartup()) to the
ACP as part of preparing the parsed arguments passed to main().
The result is an ACP encoded message being passed to MessageBox()
(which expects ACP encoded text).
In the second case, the non-ASCII characters are not displayed as
expected. What appears to happen in this case is that the Unicode
text entered at the console is transcoded to the console encoding
when the output is redirected. The result is that the program
receives console encoded input in stdin which is then passed to MessageBox()
(which expects ACP encoded text).
In order to validate that the use of the console encoding was not
somehow specific to the echo command, I retried the
experiment using dir and files named résumé.txt and þeta.txt
(þ lacks representation in CP437). For both files, dir
was able to display the correct name to the console regardless
of the whether the current console encoding was CP437 or
Windows-1252 (dir appears to use the Unicode console
APIs when output is directed directly to the console;
analogous to what is proposed in P2093). However, when
the output was piped to display-message, the output
was produced in the console encoding analogous to echo
above.
Conclusions:
The previous experiment prompted me to question what encoding is used for .bat files. To answer that question, I created CP437 and Windows-1252 encoded .bat files with the two commands from the prior experiment:
@echo off
display-message "I like pâté"
echo "I like pâté" | display-message
Each of these was then run with matched and mismatched console encodings.
>chcp 437
Active code page: 437
>test-display-message-cp437.bat
>test-display-message-w1252.bat
>chcp 1252
Active code page: 1252
>test-display-message-cp437.bat
>test-display-message-w1252.bat
Conclusions: