On Wed, 9 Sep 2020 at 20:39, Tom Honermann <tom@honermann.net> wrote:

On 9/9/20 1:26 PM, Corentin wrote:

On Wed, Sep 9, 2020, 18:42 Tom Honermann <tom@honermann.net> wrote:

I conducted an experiment today that I've been meaning to do for a while now and that is relevant for this paper (and perhaps a worthwhile addition to the paper).

Thanks Tom, this is very interesting.

But it is only marginally relevant to the paper which if I remember correctly goes into some details about round-trip.

It is not something the paper is trying to neither prevent nor mandate.

We only seek to mandate there exist a transformation from source to Unicode, which doesn't imply nor require that the transformation be reversible.

Support (or lack thereof) of round-tripping is possible within these constraints and never observable from the program. And as you observed it doesn't match existing practices - except on some edg derived compilers.

Discussion is not limited to what is proposed in the paper and is encouraged in order to probe suitability of a proposal within the full complexity of the C++ ecosystem. This experiment and other discussion on the mailing list is intended to probe the full solution space. The goal of such questioning is to identify solutions that increase consensus.

Are you proposing that a specific round tripping behavior for shift jis should be mandated?

Microsoft code page 932 (Microsoft's Shift JIS variant) defines a number of code points that do not round trip through Unicode due to having duplicate code point assignments that are (quite reasonably) not duplicated in Unicode. One of them is:

0x8795   -> U+221a   -> 0x81e3   Square Root

The following test case demonstrates existing behavior as exhibited by gcc (11.0.0 snapshot) and Visual C++ (2019). Both of these compilers accept the test case when compiled with the command lines shown. Repeating the experiment will require substituting the replacement characters in the string literal with the indicated Shift JIS double byte sequence (or using the attached file if it survives transmission).

$ cat t.cpp

constexpr char sx8795[] = "��"; // 0x87 0x95 => CP932 0x8795 == U+221A (Square Root)static_assert((unsigned char)sx8795[0] == 0x81); // Converted to CP932 0x81e3static_assert((unsigned char)sx8795[1] == 0xe3); // Converted to CP932 0x81e3static_assert((unsigned char)sx8795[2] == 0);constexpr char sx81e3[] = "��"; // 0x81 0xe3 => CP932 0x81e3 == U+221A (Square Root)static_assert((unsigned char)sx81e3[0] == 0x81); // Preservedstatic_assert((unsigned char)sx81e3[1] == 0xe3); // Preservedstatic_assert((unsigned char)sx81e3[2] == 0);

$ g++ -c -finput-charset=cp932 -fexec-charset=cp932 -std=c++17 t.cpp

<no errors>

$ cl /c /std:c++17 /source-charset:.932 /execution-charset:.932 t.cpp

<no errors>

Note that both compilers converted the source 0x8795 double byte sequence to 0x81e3; the original source bytes were not preserved.

However, both compilers fail the test case if character set conversions are not specified:

$ g++ -c -std=c++17 t.cppt.cpp:2:40: error: static assertion failed    2 | static_assert((unsigned char)sx8795[0] == 0x81); // Converted to CP932 0x81e3      |               ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~t.cpp:3:40: error: static assertion failed    3 | static_assert((unsigned char)sx8795[1] == 0xe3); // Converted to CP932 0x81e3      |               ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~

$ cl /nologo /diagnostics:caret /c /std:c++17 t.cppt.cppt.cpp(2,40): error C2607: static assertion failed static_assert((unsigned char)sx8795[0] == 0x81); // Converted to CP932 0x81e3                                        ^ t.cpp(3,40): error C2607: static assertion failed static_assert((unsigned char)sx8795[1] == 0xe3); // Converted to CP932 0x81e3

                                       ^

Note that these double byte sequences are not valid UTF-8 sequences, so the pass-the-bytes mode exhibited by gcc is not an artifact of normal UTF-8 handling. The sequences are valid for Windows-1252 (which is the default encoding used by Visual C++ on the system I tested on), so this test is not indicative of a pass-the-bytes mode for Visual C++.

When gcc assumes the source is utf-8, iconv is not called and no check or conversations is performed, I believe that's what you are seeing here.

Gcc exhibits the same behavior whether the source encoding is assumed to be UTF-8 or explicitly indicated as such. I think the more complicated answer is that gcc doesn't use iconv for string literal contents when the source and execution character sets are the same (or it ignores conversion errors in that case). Errors are (necessarily) issued when the source and execution character sets differ:

$ g++ -c -std=c++17 -finput-charset=utf-8 -fexec-charset=cp932 t.cppt.cpp:1:27: error: converting to execution character set: Invalid or incomplete multibyte or wide character    1 | constexpr char sx8795[] = "��"; // 0x87 0x95 => CP932 0x8795 == U+221A (Square Root)      |                           ^~~~

...

t.cpp:5:27: error: converting to execution character set: Invalid or incomplete multibyte or wide character    5 | constexpr char sx81e3[] = "��"; // 0x81 0xe3 => CP932 0x81e3 == U+221A (Square Root)      |                           ^~~~...

This example, as well as the preceding and following quoted ones are examples of mojibake and are used in this context to illustrate behavior with ill-formed UTF-8 input. I failed to point that out previously.

Tom.
Finally, it is worth noting that explicitly treating the source as UTF-8 does not cause any additional errors for gcc, but does for Visual C++ (Gcc does not diagnose the ill-formed UTF-8 sequences in the string literals, but Visual C++ does).

$ g++ -c -finput-charset=utf-8 -std=c++17 t.cppt.cpp:2:40: error: static assertion failed 2 | static_assert((unsigned char)sx8795[0] == 0x81); // Converted to CP932 0x81e3 | ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~t.cpp:3:40: error: static assertion failed 3 | static_assert((unsigned char)sx8795[1] == 0xe3); // Converted to CP932 0x81e3 | ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~

$ cl /nologo /diagnostics:caret /c /std:c++17 /utf-8 t.cppt.cppt.cpp(1,1): warning C4828: The file contains a character starting at offset 0x1b that is illegal in the current source character set (codepage 65001).constexpr char sx8795[] = ""; // 0x87 0x95 => CP932 0x8795 == U+221A (Square Root)^t.cpp(1,1): warning C4828: The file contains a character starting at offset 0x1c that is illegal in the current source character set (codepage 65001).constexpr char sx8795[] = ""; // 0x87 0x95 => CP932 0x8795 == U+221A (Square Root)^t.cpp(1,1): warning C4828: The file contains a character starting at offset 0x13e that is illegal in the current source character set (codepage 65001).constexpr char sx8795[] = ""; // 0x87 0x95 => CP932 0x8795 == U+221A (Square Root)^t.cpp(1,1): warning C4828: The file contains a character starting at offset 0x13f that is illegal in the current source character set (codepage 65001).constexpr char sx8795[] = ""; // 0x87 0x95 => CP932 0x8795 == U+221A (Square Root)^t.cpp(2,40): error C2607: static assertion failedstatic_assert((unsigned char)sx8795[0] == 0x81); // Converted to CP932 0x81e3 ^t.cpp(3,40): error C2607: static assertion failedstatic_assert((unsigned char)sx8795[1] == 0xe3); // Converted to CP932 0x81e3 ^t.cpp(5,27): error C2001: newline in constantconstexpr char sx81e3[] = ""; // 0x81 0xe3 => CP932 0x81e3 == U+221A (Square Root) ^t.cpp(6,1): error C2143: syntax error: missing ';' before 'static_assert'static_assert((unsigned char)sx81e3[0] == 0x81); // Preserved^t.cpp(8,40): error C2607: static assertion failedstatic_assert((unsigned char)sx81e3[2] == 0); ^

Tom.

On 9/9/20 11:40 AM, Tom Honermann via SG16 wrote:
On 8/24/20 8:31 AM, Peter Brett via SG16 wrote:
Hi all,

In this week's meeting, we are going to discuss the remaining
proposals from P2178R1 "Misc lexing and string handling improvements".
In particular, we will discuss proposal 9:

    Proposal 9: Reaffirming Unicode as the character set of the
    internal representation

In anticipation of a lively discussion, Corentin and I have written a
short new paper which will be appearing in the September mailing.

    P2194R0 The character set of C++ source code is Unicode
    https://isocpp.org/files/papers/P2194R0.pdf
In preparation for this discussion, please also (re-)read section 5.2.1 of the C99 Rationale document; in particular the "UCN Models" section on pages 20 and 21.

Tom.
We hope that the study group finds this contribution helpful and
informative.

Best regards,

                       Peter