I conducted an experiment today that I've been meaning to do for a while now and that is relevant for this paper (and perhaps a worthwhile addition to the paper).

Microsoft code page 932 (Microsoft's Shift JIS variant) defines a number of code points that do not round trip through Unicode due to having duplicate code point assignments that are (quite reasonably) not duplicated in Unicode. One of them is:

0x8795 -> U+221a -> 0x81e3 Square Root

The following test case demonstrates existing behavior as exhibited by gcc (11.0.0 snapshot) and Visual C++ (2019). Both of these compilers accept the test case when compiled with the command lines shown. Repeating the experiment will require substituting the replacement characters in the string literal with the indicated Shift JIS double byte sequence (or using the attached file if it survives transmission).

constexpr char sx8795[] = "��"; // 0x87 0x95 => CP932 0x8795 == U+221A (Square Root)static_assert((unsigned char)sx8795[0] == 0x81); // Converted to CP932 0x81e3static_assert((unsigned char)sx8795[1] == 0xe3); // Converted to CP932 0x81e3static_assert((unsigned char)sx8795[2] == 0);constexpr char sx81e3[] = "��"; // 0x81 0xe3 => CP932 0x81e3 == U+221A (Square Root)static_assert((unsigned char)sx81e3[0] == 0x81); // Preservedstatic_assert((unsigned char)sx81e3[1] == 0xe3); // Preservedstatic_assert((unsigned char)sx81e3[2] == 0);

$ g++ -c -finput-charset=cp932 -fexec-charset=cp932 -std=c++17 t.cpp

$ cl /c /std:c++17 /source-charset:.932 /execution-charset:.932 t.cpp

Note that both compilers converted the source 0x8795 double byte sequence to 0x81e3; the original source bytes were not preserved.

However, both compilers fail the test case if character set conversions are not specified:

$ g++ -c -std=c++17 t.cppt.cpp:2:40: error: static assertion failed 2 | static_assert((unsigned char)sx8795[0] == 0x81); // Converted to CP932 0x81e3 | ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~t.cpp:3:40: error: static assertion failed 3 | static_assert((unsigned char)sx8795[1] == 0xe3); // Converted to CP932 0x81e3 | ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~

$ cl /nologo /diagnostics:caret /c /std:c++17 t.cppt.cppt.cpp(2,40): error C2607: static assertion failed static_assert((unsigned char)sx8795[0] == 0x81); // Converted to CP932 0x81e3 ^ t.cpp(3,40): error C2607: static assertion failed static_assert((unsigned char)sx8795[1] == 0xe3); // Converted to CP932 0x81e3

Note that these double byte sequences are not valid UTF-8 sequences, so the pass-the-bytes mode exhibited by gcc is not an artifact of normal UTF-8 handling. The sequences are valid for Windows-1252 (which is the default encoding used by Visual C++ on the system I tested on), so this test is not indicative of a pass-the-bytes mode for Visual C++.

Finally, it is worth noting that explicitly treating the source as UTF-8 does not cause any additional errors for gcc, but does for Visual C++ (Gcc does not diagnose the ill-formed UTF-8 sequences in the string literals, but Visual C++ does).

$ g++ -c -finput-charset=utf-8 -std=c++17 t.cppt.cpp:2:40: error: static assertion failed 2 | static_assert((unsigned char)sx8795[0] == 0x81); // Converted to CP932 0x81e3 | ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~t.cpp:3:40: error: static assertion failed 3 | static_assert((unsigned char)sx8795[1] == 0xe3); // Converted to CP932 0x81e3 | ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~

$ cl /nologo /diagnostics:caret /c /std:c++17 /utf-8 t.cppt.cppt.cpp(1,1): warning C4828: The file contains a character starting at offset 0x1b that is illegal in the current source character set (codepage 65001).constexpr char sx8795[] = ""; // 0x87 0x95 => CP932 0x8795 == U+221A (Square Root)^t.cpp(1,1): warning C4828: The file contains a character starting at offset 0x1c that is illegal in the current source character set (codepage 65001).constexpr char sx8795[] = ""; // 0x87 0x95 => CP932 0x8795 == U+221A (Square Root)^t.cpp(1,1): warning C4828: The file contains a character starting at offset 0x13e that is illegal in the current source character set (codepage 65001).constexpr char sx8795[] = ""; // 0x87 0x95 => CP932 0x8795 == U+221A (Square Root)^t.cpp(1,1): warning C4828: The file contains a character starting at offset 0x13f that is illegal in the current source character set (codepage 65001).constexpr char sx8795[] = ""; // 0x87 0x95 => CP932 0x8795 == U+221A (Square Root)^t.cpp(2,40): error C2607: static assertion failedstatic_assert((unsigned char)sx8795[0] == 0x81); // Converted to CP932 0x81e3 ^t.cpp(3,40): error C2607: static assertion failedstatic_assert((unsigned char)sx8795[1] == 0xe3); // Converted to CP932 0x81e3 ^t.cpp(5,27): error C2001: newline in constantconstexpr char sx81e3[] = ""; // 0x81 0xe3 => CP932 0x81e3 == U+221A (Square Root) ^t.cpp(6,1): error C2143: syntax error: missing ';' before 'static_assert'static_assert((unsigned char)sx81e3[0] == 0x81); // Preserved^t.cpp(8,40): error C2607: static assertion failedstatic_assert((unsigned char)sx81e3[2] == 0); ^

On 9/9/20 11:40 AM, Tom Honermann via SG16 wrote:

On 8/24/20 8:31 AM, Peter Brett via SG16 wrote: