sg16: Re: [SG16] P2194R0 The character set of C++ source code is Unicode

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 9 Sep 2020 12:42:13 -0400

I conducted an experiment today that I've been meaning to do for a while
now and that is relevant for this paper (and perhaps a worthwhile
addition to the paper).

Microsoft code page 932 (Microsoft's Shift JIS variant) defines a number
of code points that do not round trip through Unicode due to having
duplicate code point assignments that are (quite reasonably) not
duplicated in Unicode. One of them is:

    0x8795 -> U+221a -> 0x81e3 Square Root

The following test case demonstrates existing behavior as exhibited by
gcc (11.0.0 snapshot) and Visual C++ (2019). Both of these compilers
accept the test case when compiled with the command lines shown.
Repeating the experiment will require substituting the replacement
characters in the string literal with the indicated Shift JIS double
byte sequence (or using the attached file if it survives transmission).

    $ cat t.cpp
    constexpr char sx8795[] = "��"; // 0x87 0x95 => CP932 0x8795 ==
    U+221A (Square Root)
    static_assert((unsigned char)sx8795[0] == 0x81); // Converted to
    CP932 0x81e3
    static_assert((unsigned char)sx8795[1] == 0xe3); // Converted to
    CP932 0x81e3
    static_assert((unsigned char)sx8795[2] == 0);
    constexpr char sx81e3[] = "��"; // 0x81 0xe3 => CP932 0x81e3 ==
    U+221A (Square Root)
    static_assert((unsigned char)sx81e3[0] == 0x81); // Preserved
    static_assert((unsigned char)sx81e3[1] == 0xe3); // Preserved
    static_assert((unsigned char)sx81e3[2] == 0);

    $ g++ -c -finput-charset=cp932 -fexec-charset=cp932 -std=c++17 t.cpp
    <no errors>

    $ cl /c /std:c++17 /source-charset:.932 /execution-charset:.932 t.cpp
    <no errors>

Note that both compilers converted the source 0x8795 double byte
sequence to 0x81e3; the original source bytes were not preserved.

However, both compilers fail the test case if character set conversions
are not specified:

    $ g++ -c -std=c++17 t.cpp
    t.cpp:2:40: error: static assertion failed
         2 | static_assert((unsigned char)sx8795[0] == 0x81); //
    Converted to CP932 0x81e3
           | ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~
    t.cpp:3:40: error: static assertion failed
         3 | static_assert((unsigned char)sx8795[1] == 0xe3); //
    Converted to CP932 0x81e3
           | ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~

    $ cl /nologo /diagnostics:caret /c /std:c++17 t.cpp
    t.cpp
    t.cpp(2,40): error C2607: static assertion failed
    static_assert((unsigned char)sx8795[0] == 0x81); // Converted to
    CP932 0x81e3
                                            ^
    t.cpp(3,40): error C2607: static assertion failed
    static_assert((unsigned char)sx8795[1] == 0xe3); // Converted to
    CP932 0x81e3
    ^

Note that these double byte sequences are not valid UTF-8 sequences, so
the pass-the-bytes mode exhibited by gcc is not an artifact of normal
UTF-8 handling. The sequences are valid for Windows-1252 (which is the
default encoding used by Visual C++ on the system I tested on), so this
test is not indicative of a pass-the-bytes mode for Visual C++.

Finally, it is worth noting that explicitly treating the source as UTF-8
does not cause any additional errors for gcc, but does for Visual C++
(Gcc does not diagnose the ill-formed UTF-8 sequences in the string
literals, but Visual C++ does).

    $ g++ -c -finput-charset=utf-8 -std=c++17 t.cpp
    t.cpp:2:40: error: static assertion failed
         2 | static_assert((unsigned char)sx8795[0] == 0x81); //
    Converted to CP932 0x81e3
           | ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~
    t.cpp:3:40: error: static assertion failed
         3 | static_assert((unsigned char)sx8795[1] == 0xe3); //
    Converted to CP932 0x81e3
           | ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~

    $ cl /nologo /diagnostics:caret /c /std:c++17 /utf-8 t.cpp
    t.cpp
    t.cpp(1,1): warning C4828: The file contains a character starting at
    offset 0x1b that is illegal in the current source character set
    (codepage 65001).
    constexpr char sx8795[] = ""; // 0x87 0x95 => CP932 0x8795 == U+221A
    (Square Root)
    ^
    t.cpp(1,1): warning C4828: The file contains a character starting at
    offset 0x1c that is illegal in the current source character set
    (codepage 65001).
    constexpr char sx8795[] = ""; // 0x87 0x95 => CP932 0x8795 == U+221A
    (Square Root)
    ^
    t.cpp(1,1): warning C4828: The file contains a character starting at
    offset 0x13e that is illegal in the current source character set
    (codepage 65001).
    constexpr char sx8795[] = ""; // 0x87 0x95 => CP932 0x8795 == U+221A
    (Square Root)
    ^
    t.cpp(1,1): warning C4828: The file contains a character starting at
    offset 0x13f that is illegal in the current source character set
    (codepage 65001).
    constexpr char sx8795[] = ""; // 0x87 0x95 => CP932 0x8795 == U+221A
    (Square Root)
    ^
    t.cpp(2,40): error C2607: static assertion failed
    static_assert((unsigned char)sx8795[0] == 0x81); // Converted to
    CP932 0x81e3
                                            ^
    t.cpp(3,40): error C2607: static assertion failed
    static_assert((unsigned char)sx8795[1] == 0xe3); // Converted to
    CP932 0x81e3
                                            ^
    t.cpp(5,27): error C2001: newline in constant
    constexpr char sx81e3[] = ""; // 0x81 0xe3 => CP932 0x81e3 ==
    U+221A (Square Root)
                               ^
    t.cpp(6,1): error C2143: syntax error: missing ';' before
    'static_assert'
    static_assert((unsigned char)sx81e3[0] == 0x81); // Preserved
    ^
    t.cpp(8,40): error C2607: static assertion failed
    static_assert((unsigned char)sx81e3[2] == 0);
                                            ^

Tom.

On 9/9/20 11:40 AM, Tom Honermann via SG16 wrote:
> On 8/24/20 8:31 AM, Peter Brett via SG16 wrote:
>> Hi all,
>>
>> In this week's meeting, we are going to discuss the remaining
>> proposals from P2178R1 "Misc lexing and string handling improvements".
>> In particular, we will discuss proposal 9:
>>
>> Proposal 9: Reaffirming Unicode as the character set of the
>> internal representation
>>
>> In anticipation of a lively discussion, Corentin and I have written a
>> short new paper which will be appearing in the September mailing.
>>
>> P2194R0 The character set of C++ source code is Unicode
>> https://isocpp.org/files/papers/P2194R0.pdf
>
> In preparation for this discussion, please also (re-)read section
> 5.2.1 of the C99 Rationale document
> <http://www.open-std.org/jtc1/sc22/wg14/www/C99RationaleV5.10.pdf>; in
> particular the "UCN Models" section on pages 20 and 21.
>
> Tom.
>
>> We hope that the study group finds this contribution helpful and
>> informative.
>>
>> Best regards,
>>
>> Peter
>>
>
>

Received on 2020-09-09 11:45:44