sg16: Re: [SG16] P2194R0 The character set of C++ source code is Unicode

From: Corentin <corentin.jabot_at_[hidden]>
Date: Wed, 9 Sep 2020 19:26:30 +0200

On Wed, Sep 9, 2020, 18:42 Tom Honermann <tom_at_[hidden]> wrote:

> I conducted an experiment today that I've been meaning to do for a while
> now and that is relevant for this paper (and perhaps a worthwhile addition
> to the paper).
>

Thanks Tom, this is very interesting.
But it is only marginally relevant to the paper which if I remember
correctly goes into some details about round-trip.
It is not something the paper is trying to neither prevent nor mandate.
We only seek to mandate there exist a transformation from source to
Unicode, which doesn't imply nor require that the transformation be
reversible.
Support (or lack thereof) of round-tripping is possible within these
constraints and never observable from the program. And as you observed it
doesn't match existing practices - except on some edg derived compilers.

> Microsoft code page 932 (Microsoft's Shift JIS variant) defines a number
> of code points that do not round trip through Unicode due to having
> duplicate code point assignments that are (quite reasonably) not duplicated
> in Unicode. One of them is:
>
> 0x8795 -> U+221a -> 0x81e3 Square Root
>
> The following test case demonstrates existing behavior as exhibited by gcc
> (11.0.0 snapshot) and Visual C++ (2019). Both of these compilers accept
> the test case when compiled with the command lines shown. Repeating the
> experiment will require substituting the replacement characters in the
> string literal with the indicated Shift JIS double byte sequence (or using
> the attached file if it survives transmission).
>
> $ cat t.cpp
> constexpr char sx8795[] = "��"; // 0x87 0x95 => CP932 0x8795 == U+221A
> (Square Root)
> static_assert((unsigned char)sx8795[0] == 0x81); // Converted to CP932
> 0x81e3
> static_assert((unsigned char)sx8795[1] == 0xe3); // Converted to CP932
> 0x81e3
> static_assert((unsigned char)sx8795[2] == 0);
> constexpr char sx81e3[] = "��"; // 0x81 0xe3 => CP932 0x81e3 == U+221A
> (Square Root)
> static_assert((unsigned char)sx81e3[0] == 0x81); // Preserved
> static_assert((unsigned char)sx81e3[1] == 0xe3); // Preserved
> static_assert((unsigned char)sx81e3[2] == 0);
>
> $ g++ -c -finput-charset=cp932 -fexec-charset=cp932 -std=c++17 t.cpp
> <no errors>
>
> $ cl /c /std:c++17 /source-charset:.932 /execution-charset:.932 t.cpp
> <no errors>
>
> Note that both compilers converted the source 0x8795 double byte sequence
> to 0x81e3; the original source bytes were not preserved.
>
> However, both compilers fail the test case if character set conversions
> are not specified:
>
> $ g++ -c -std=c++17 t.cpp
> t.cpp:2:40: error: static assertion failed
> 2 | static_assert((unsigned char)sx8795[0] == 0x81); // Converted to
> CP932 0x81e3
> | ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~
> t.cpp:3:40: error: static assertion failed
> 3 | static_assert((unsigned char)sx8795[1] == 0xe3); // Converted to
> CP932 0x81e3
> | ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~
>
> $ cl /nologo /diagnostics:caret /c /std:c++17 t.cpp
> t.cpp
> t.cpp(2,40): error C2607: static assertion failed
> static_assert((unsigned char)sx8795[0] == 0x81); // Converted to CP932
> 0x81e3
> ^
> t.cpp(3,40): error C2607: static assertion failed
> static_assert((unsigned char)sx8795[1] == 0xe3); // Converted to CP932
> 0x81e3
> ^
>
> Note that these double byte sequences are not valid UTF-8 sequences, so
> the pass-the-bytes mode exhibited by gcc is not an artifact of normal UTF-8
> handling. The sequences are valid for Windows-1252 (which is the default
> encoding used by Visual C++ on the system I tested on), so this test is not
> indicative of a pass-the-bytes mode for Visual C++.
>

When gcc assumes the source is utf-8, iconv is not called and no check or
conversations is performed, I believe that's what you are seeing here.

> Finally, it is worth noting that explicitly treating the source as UTF-8
> does not cause any additional errors for gcc, but does for Visual C++ (Gcc
> does not diagnose the ill-formed UTF-8 sequences in the string literals,
> but Visual C++ does).
>
> $ g++ -c -finput-charset=utf-8 -std=c++17 t.cpp
> t.cpp:2:40: error: static assertion failed
> 2 | static_assert((unsigned char)sx8795[0] == 0x81); // Converted to
> CP932 0x81e3
> | ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~
> t.cpp:3:40: error: static assertion failed
> 3 | static_assert((unsigned char)sx8795[1] == 0xe3); // Converted to
> CP932 0x81e3
> | ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~
>
> $ cl /nologo /diagnostics:caret /c /std:c++17 /utf-8 t.cpp
> t.cpp
> t.cpp(1,1): warning C4828: The file contains a character starting at
> offset 0x1b that is illegal in the current source character set (codepage
> 65001).
> constexpr char sx8795[] = ""; // 0x87 0x95 => CP932 0x8795 == U+221A
> (Square Root)
> ^
> t.cpp(1,1): warning C4828: The file contains a character starting at
> offset 0x1c that is illegal in the current source character set (codepage
> 65001).
> constexpr char sx8795[] = ""; // 0x87 0x95 => CP932 0x8795 == U+221A
> (Square Root)
> ^
> t.cpp(1,1): warning C4828: The file contains a character starting at
> offset 0x13e that is illegal in the current source character set (codepage
> 65001).
> constexpr char sx8795[] = ""; // 0x87 0x95 => CP932 0x8795 == U+221A
> (Square Root)
> ^
> t.cpp(1,1): warning C4828: The file contains a character starting at
> offset 0x13f that is illegal in the current source character set (codepage
> 65001).
> constexpr char sx8795[] = ""; // 0x87 0x95 => CP932 0x8795 == U+221A
> (Square Root)
> ^
> t.cpp(2,40): error C2607: static assertion failed
> static_assert((unsigned char)sx8795[0] == 0x81); // Converted to CP932
> 0x81e3
> ^
> t.cpp(3,40): error C2607: static assertion failed
> static_assert((unsigned char)sx8795[1] == 0xe3); // Converted to CP932
> 0x81e3
> ^
> t.cpp(5,27): error C2001: newline in constant
> constexpr char sx81e3[] = ""; // 0x81 0xe3 => CP932 0x81e3 == U+221A
> (Square Root)
> ^
> t.cpp(6,1): error C2143: syntax error: missing ';' before 'static_assert'
> static_assert((unsigned char)sx81e3[0] == 0x81); // Preserved
> ^
> t.cpp(8,40): error C2607: static assertion failed
> static_assert((unsigned char)sx81e3[2] == 0);
> ^
>
> Tom.
>
> On 9/9/20 11:40 AM, Tom Honermann via SG16 wrote:
>
> On 8/24/20 8:31 AM, Peter Brett via SG16 wrote:
>
> Hi all,
>
> In this week's meeting, we are going to discuss the remaining
> proposals from P2178R1 "Misc lexing and string handling improvements".
> In particular, we will discuss proposal 9:
>
> Proposal 9: Reaffirming Unicode as the character set of the
> internal representation
>
> In anticipation of a lively discussion, Corentin and I have written a
> short new paper which will be appearing in the September mailing.
>
> P2194R0 The character set of C++ source code is Unicode
> https://isocpp.org/files/papers/P2194R0.pdf
>
> In preparation for this discussion, please also (re-)read section 5.2.1 of
> the C99 Rationale document
> <http://www.open-std.org/jtc1/sc22/wg14/www/C99RationaleV5.10.pdf>; in
> particular the "UCN Models" section on pages 20 and 21.
>
> Tom.
>
> We hope that the study group finds this contribution helpful and
> informative.
>
> Best regards,
>
> Peter
>
>
>
>
>
>

Received on 2020-09-09 12:30:16