sg16: Re: [SG16] P2194R0 The character set of C++ source code is Unicode

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 9 Sep 2020 14:39:00 -0400

On 9/9/20 1:26 PM, Corentin wrote:
>
>
> On Wed, Sep 9, 2020, 18:42 Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> I conducted an experiment today that I've been meaning to do for a
> while now and that is relevant for this paper (and perhaps a
> worthwhile addition to the paper).
>
>
> Thanks Tom, this is very interesting.
> But it is only marginally relevant to the paper which if I remember
> correctly goes into some details about round-trip.
> It is not something the paper is trying to neither prevent nor mandate.
> We only seek to mandate there exist a transformation from source to
> Unicode, which doesn't imply nor require that the transformation be
> reversible.
> Support (or lack thereof) of round-tripping is possible within these
> constraints and never observable from the program. And as you observed
> it doesn't match existing practices - except on some edg derived
> compilers.
Discussion is not limited to what is proposed in the paper and is
encouraged in order to probe suitability of a proposal within the full
complexity of the C++ ecosystem. This experiment and other discussion
on the mailing list is intended to probe the full solution space. The
goal of such questioning is to identify solutions that increase consensus.
>
>
>
> Microsoft code page 932 (Microsoft's Shift JIS variant) defines a
> number of code points that do not round trip through Unicode due
> to having duplicate code point assignments that are (quite
> reasonably) not duplicated in Unicode. One of them is:
>
> 0x8795 -> U+221a -> 0x81e3 Square Root
>
> The following test case demonstrates existing behavior as
> exhibited by gcc (11.0.0 snapshot) and Visual C++ (2019). Both of
> these compilers accept the test case when compiled with the
> command lines shown. Repeating the experiment will require
> substituting the replacement characters in the string literal with
> the indicated Shift JIS double byte sequence (or using the
> attached file if it survives transmission).
>
> $ cat t.cpp
> constexpr char sx8795[] = "��"; // 0x87 0x95 => CP932 0x8795
> == U+221A (Square Root)
> static_assert((unsigned char)sx8795[0] == 0x81); // Converted
> to CP932 0x81e3
> static_assert((unsigned char)sx8795[1] == 0xe3); // Converted
> to CP932 0x81e3
> static_assert((unsigned char)sx8795[2] == 0);
> constexpr char sx81e3[] = "��"; // 0x81 0xe3 => CP932 0x81e3
> == U+221A (Square Root)
> static_assert((unsigned char)sx81e3[0] == 0x81); // Preserved
> static_assert((unsigned char)sx81e3[1] == 0xe3); // Preserved
> static_assert((unsigned char)sx81e3[2] == 0);
>
> $ g++ -c -finput-charset=cp932 -fexec-charset=cp932 -std=c++17
> t.cpp
> <no errors>
>
> $ cl /c /std:c++17 /source-charset:.932
> /execution-charset:.932 t.cpp
> <no errors>
>
> Note that both compilers converted the source 0x8795 double byte
> sequence to 0x81e3; the original source bytes were not preserved.
>
> However, both compilers fail the test case if character set
> conversions are not specified:
>
> $ g++ -c -std=c++17 t.cpp
> t.cpp:2:40: error: static assertion failed
> 2 | static_assert((unsigned char)sx8795[0] == 0x81); //
> Converted to CP932 0x81e3
> | ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~
> t.cpp:3:40: error: static assertion failed
> 3 | static_assert((unsigned char)sx8795[1] == 0xe3); //
> Converted to CP932 0x81e3
> | ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~
>
> $ cl /nologo /diagnostics:caret /c /std:c++17 t.cpp
> t.cpp
> t.cpp(2,40): error C2607: static assertion failed
> static_assert((unsigned char)sx8795[0] == 0x81); // Converted
> to CP932 0x81e3
> ^
> t.cpp(3,40): error C2607: static assertion failed
> static_assert((unsigned char)sx8795[1] == 0xe3); // Converted
> to CP932 0x81e3
> ^
>
> Note that these double byte sequences are not valid UTF-8
> sequences, so the pass-the-bytes mode exhibited by gcc is not an
> artifact of normal UTF-8 handling. The sequences are valid for
> Windows-1252 (which is the default encoding used by Visual C++ on
> the system I tested on), so this test is not indicative of a
> pass-the-bytes mode for Visual C++.
>
>
> When gcc assumes the source is utf-8, iconv is not called and no check
> or conversations is performed, I believe that's what you are seeing here.

Gcc exhibits the same behavior whether the source encoding is assumed to
be UTF-8 or explicitly indicated as such. I think the more complicated
answer is that gcc doesn't use iconv for string literal contents when
the source and execution character sets are the same (or it ignores
conversion errors in that case). Errors are (necessarily) issued when
the source and execution character sets differ:

    $ g++ -c -std=c++17 -finput-charset=utf-8 -fexec-charset=cp932 t.cpp
    t.cpp:1:27: error: converting to execution character set: Invalid or
    incomplete multibyte or wide character
         1 | constexpr char sx8795[] = "��"; // 0x87 0x95 => CP932
    0x8795 == U+221A (Square Root)
           | ^~~~

    ...

    t.cpp:5:27: error: converting to execution character set: Invalid or
    incomplete multibyte or wide character
         5 | constexpr char sx81e3[] = "��"; // 0x81 0xe3 => CP932
    0x81e3 == U+221A (Square Root)
           | ^~~~
    ...

This example, as well as the preceding and following quoted ones are
examples of mojibake and are used in this context to illustrate behavior
with ill-formed UTF-8 input. I failed to point that out previously.

Tom.

>
>
> Finally, it is worth noting that explicitly treating the source as
> UTF-8 does not cause any additional errors for gcc, but does for
> Visual C++ (Gcc does not diagnose the ill-formed UTF-8 sequences
> in the string literals, but Visual C++ does).
>
> $ g++ -c -finput-charset=utf-8 -std=c++17 t.cpp
> t.cpp:2:40: error: static assertion failed
> 2 | static_assert((unsigned char)sx8795[0] == 0x81); //
> Converted to CP932 0x81e3
> | ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~
> t.cpp:3:40: error: static assertion failed
> 3 | static_assert((unsigned char)sx8795[1] == 0xe3); //
> Converted to CP932 0x81e3
> | ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~
>
> $ cl /nologo /diagnostics:caret /c /std:c++17 /utf-8 t.cpp
> t.cpp
> t.cpp(1,1): warning C4828: The file contains a character
> starting at offset 0x1b that is illegal in the current source
> character set (codepage 65001).
> constexpr char sx8795[] = ""; // 0x87 0x95 => CP932 0x8795 ==
> U+221A (Square Root)
> ^
> t.cpp(1,1): warning C4828: The file contains a character
> starting at offset 0x1c that is illegal in the current source
> character set (codepage 65001).
> constexpr char sx8795[] = ""; // 0x87 0x95 => CP932 0x8795 ==
> U+221A (Square Root)
> ^
> t.cpp(1,1): warning C4828: The file contains a character
> starting at offset 0x13e that is illegal in the current source
> character set (codepage 65001).
> constexpr char sx8795[] = ""; // 0x87 0x95 => CP932 0x8795 ==
> U+221A (Square Root)
> ^
> t.cpp(1,1): warning C4828: The file contains a character
> starting at offset 0x13f that is illegal in the current source
> character set (codepage 65001).
> constexpr char sx8795[] = ""; // 0x87 0x95 => CP932 0x8795 ==
> U+221A (Square Root)
> ^
> t.cpp(2,40): error C2607: static assertion failed
> static_assert((unsigned char)sx8795[0] == 0x81); // Converted
> to CP932 0x81e3
> ^
> t.cpp(3,40): error C2607: static assertion failed
> static_assert((unsigned char)sx8795[1] == 0xe3); // Converted
> to CP932 0x81e3
> ^
> t.cpp(5,27): error C2001: newline in constant
> constexpr char sx81e3[] = ""; // 0x81 0xe3 => CP932 0x81e3 ==
> U+221A (Square Root)
> ^
> t.cpp(6,1): error C2143: syntax error: missing ';' before
> 'static_assert'
> static_assert((unsigned char)sx81e3[0] == 0x81); // Preserved
> ^
> t.cpp(8,40): error C2607: static assertion failed
> static_assert((unsigned char)sx81e3[2] == 0);
> ^
>
> Tom.
>
> On 9/9/20 11:40 AM, Tom Honermann via SG16 wrote:
>> On 8/24/20 8:31 AM, Peter Brett via SG16 wrote:
>>> Hi all,
>>>
>>> In this week's meeting, we are going to discuss the remaining
>>> proposals from P2178R1 "Misc lexing and string handling improvements".
>>> In particular, we will discuss proposal 9:
>>>
>>> Proposal 9: Reaffirming Unicode as the character set of the
>>> internal representation
>>>
>>> In anticipation of a lively discussion, Corentin and I have written a
>>> short new paper which will be appearing in the September mailing.
>>>
>>> P2194R0 The character set of C++ source code is Unicode
>>> https://isocpp.org/files/papers/P2194R0.pdf
>>
>> In preparation for this discussion, please also (re-)read section
>> 5.2.1 of the C99 Rationale document
>> <http://www.open-std.org/jtc1/sc22/wg14/www/C99RationaleV5.10.pdf>;
>> in particular the "UCN Models" section on pages 20 and 21.
>>
>> Tom.
>>
>>> We hope that the study group finds this contribution helpful and
>>> informative.
>>>
>>> Best regards,
>>>
>>> Peter
>>>
>>
>>
>

Received on 2020-09-09 13:42:32