C++ Logo

sg16

Advanced search

Re: [SG16] P2194R0 The character set of C++ source code is Unicode

From: Corentin <corentin.jabot_at_[hidden]>
Date: Wed, 9 Sep 2020 20:54:27 +0200
On Wed, 9 Sep 2020 at 20:39, Tom Honermann <tom_at_[hidden]> wrote:

> On 9/9/20 1:26 PM, Corentin wrote:
>
>
>
> On Wed, Sep 9, 2020, 18:42 Tom Honermann <tom_at_[hidden]> wrote:
>
>> I conducted an experiment today that I've been meaning to do for a while
>> now and that is relevant for this paper (and perhaps a worthwhile addition
>> to the paper).
>>
>
> Thanks Tom, this is very interesting.
> But it is only marginally relevant to the paper which if I remember
> correctly goes into some details about round-trip.
> It is not something the paper is trying to neither prevent nor mandate.
> We only seek to mandate there exist a transformation from source to
> Unicode, which doesn't imply nor require that the transformation be
> reversible.
> Support (or lack thereof) of round-tripping is possible within these
> constraints and never observable from the program. And as you observed it
> doesn't match existing practices - except on some edg derived compilers.
>
> Discussion is not limited to what is proposed in the paper and is
> encouraged in order to probe suitability of a proposal within the full
> complexity of the C++ ecosystem. This experiment and other discussion on
> the mailing list is intended to probe the full solution space. The goal of
> such questioning is to identify solutions that increase consensus.
>

Are you proposing that a specific round tripping behavior for shift jis
should be mandated?

>
>
>
>> Microsoft code page 932 (Microsoft's Shift JIS variant) defines a number
>> of code points that do not round trip through Unicode due to having
>> duplicate code point assignments that are (quite reasonably) not duplicated
>> in Unicode. One of them is:
>>
>> 0x8795 -> U+221a -> 0x81e3 Square Root
>>
>> The following test case demonstrates existing behavior as exhibited by
>> gcc (11.0.0 snapshot) and Visual C++ (2019). Both of these compilers
>> accept the test case when compiled with the command lines shown. Repeating
>> the experiment will require substituting the replacement characters in the
>> string literal with the indicated Shift JIS double byte sequence (or using
>> the attached file if it survives transmission).
>>
>> $ cat t.cpp
>> constexpr char sx8795[] = "��"; // 0x87 0x95 => CP932 0x8795 == U+221A
>> (Square Root)
>> static_assert((unsigned char)sx8795[0] == 0x81); // Converted to CP932
>> 0x81e3
>> static_assert((unsigned char)sx8795[1] == 0xe3); // Converted to CP932
>> 0x81e3
>> static_assert((unsigned char)sx8795[2] == 0);
>> constexpr char sx81e3[] = "��"; // 0x81 0xe3 => CP932 0x81e3 == U+221A
>> (Square Root)
>> static_assert((unsigned char)sx81e3[0] == 0x81); // Preserved
>> static_assert((unsigned char)sx81e3[1] == 0xe3); // Preserved
>> static_assert((unsigned char)sx81e3[2] == 0);
>>
>> $ g++ -c -finput-charset=cp932 -fexec-charset=cp932 -std=c++17 t.cpp
>> <no errors>
>>
>> $ cl /c /std:c++17 /source-charset:.932 /execution-charset:.932 t.cpp
>> <no errors>
>>
>> Note that both compilers converted the source 0x8795 double byte sequence
>> to 0x81e3; the original source bytes were not preserved.
>>
>> However, both compilers fail the test case if character set conversions
>> are not specified:
>>
>> $ g++ -c -std=c++17 t.cpp
>> t.cpp:2:40: error: static assertion failed
>> 2 | static_assert((unsigned char)sx8795[0] == 0x81); // Converted to
>> CP932 0x81e3
>> | ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~
>> t.cpp:3:40: error: static assertion failed
>> 3 | static_assert((unsigned char)sx8795[1] == 0xe3); // Converted to
>> CP932 0x81e3
>> | ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~
>>
>> $ cl /nologo /diagnostics:caret /c /std:c++17 t.cpp
>> t.cpp
>> t.cpp(2,40): error C2607: static assertion failed
>> static_assert((unsigned char)sx8795[0] == 0x81); // Converted to CP932
>> 0x81e3
>> ^
>> t.cpp(3,40): error C2607: static assertion failed
>> static_assert((unsigned char)sx8795[1] == 0xe3); // Converted to CP932
>> 0x81e3
>> ^
>>
>> Note that these double byte sequences are not valid UTF-8 sequences, so
>> the pass-the-bytes mode exhibited by gcc is not an artifact of normal UTF-8
>> handling. The sequences are valid for Windows-1252 (which is the default
>> encoding used by Visual C++ on the system I tested on), so this test is not
>> indicative of a pass-the-bytes mode for Visual C++.
>>
>
> When gcc assumes the source is utf-8, iconv is not called and no check or
> conversations is performed, I believe that's what you are seeing here.
>
> Gcc exhibits the same behavior whether the source encoding is assumed to
> be UTF-8 or explicitly indicated as such. I think the more complicated
> answer is that gcc doesn't use iconv for string literal contents when the
> source and execution character sets are the same (or it ignores conversion
> errors in that case). Errors are (necessarily) issued when the source and
> execution character sets differ:
>
> $ g++ -c -std=c++17 -finput-charset=utf-8 -fexec-charset=cp932 t.cpp
> t.cpp:1:27: error: converting to execution character set: Invalid or
> incomplete multibyte or wide character
> 1 | constexpr char sx8795[] = "��"; // 0x87 0x95 => CP932 0x8795 ==
> U+221A (Square Root)
> | ^~~~
>
> ...
>
> t.cpp:5:27: error: converting to execution character set: Invalid or
> incomplete multibyte or wide character
> 5 | constexpr char sx81e3[] = "��"; // 0x81 0xe3 => CP932 0x81e3 ==
> U+221A (Square Root)
> | ^~~~
> ...
>
> This example, as well as the preceding and following quoted ones are
> examples of mojibake and are used in this context to illustrate behavior
> with ill-formed UTF-8 input. I failed to point that out previously.
>
> Tom.
>
>
>
>> Finally, it is worth noting that explicitly treating the source as UTF-8
>> does not cause any additional errors for gcc, but does for Visual C++ (Gcc
>> does not diagnose the ill-formed UTF-8 sequences in the string literals,
>> but Visual C++ does).
>>
>> $ g++ -c -finput-charset=utf-8 -std=c++17 t.cpp
>> t.cpp:2:40: error: static assertion failed
>> 2 | static_assert((unsigned char)sx8795[0] == 0x81); // Converted to
>> CP932 0x81e3
>> | ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~
>> t.cpp:3:40: error: static assertion failed
>> 3 | static_assert((unsigned char)sx8795[1] == 0xe3); // Converted to
>> CP932 0x81e3
>> | ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~
>>
>> $ cl /nologo /diagnostics:caret /c /std:c++17 /utf-8 t.cpp
>> t.cpp
>> t.cpp(1,1): warning C4828: The file contains a character starting at
>> offset 0x1b that is illegal in the current source character set (codepage
>> 65001).
>> constexpr char sx8795[] = ""; // 0x87 0x95 => CP932 0x8795 == U+221A
>> (Square Root)
>> ^
>> t.cpp(1,1): warning C4828: The file contains a character starting at
>> offset 0x1c that is illegal in the current source character set (codepage
>> 65001).
>> constexpr char sx8795[] = ""; // 0x87 0x95 => CP932 0x8795 == U+221A
>> (Square Root)
>> ^
>> t.cpp(1,1): warning C4828: The file contains a character starting at
>> offset 0x13e that is illegal in the current source character set (codepage
>> 65001).
>> constexpr char sx8795[] = ""; // 0x87 0x95 => CP932 0x8795 == U+221A
>> (Square Root)
>> ^
>> t.cpp(1,1): warning C4828: The file contains a character starting at
>> offset 0x13f that is illegal in the current source character set (codepage
>> 65001).
>> constexpr char sx8795[] = ""; // 0x87 0x95 => CP932 0x8795 == U+221A
>> (Square Root)
>> ^
>> t.cpp(2,40): error C2607: static assertion failed
>> static_assert((unsigned char)sx8795[0] == 0x81); // Converted to CP932
>> 0x81e3
>> ^
>> t.cpp(3,40): error C2607: static assertion failed
>> static_assert((unsigned char)sx8795[1] == 0xe3); // Converted to CP932
>> 0x81e3
>> ^
>> t.cpp(5,27): error C2001: newline in constant
>> constexpr char sx81e3[] = ""; // 0x81 0xe3 => CP932 0x81e3 == U+221A
>> (Square Root)
>> ^
>> t.cpp(6,1): error C2143: syntax error: missing ';' before 'static_assert'
>> static_assert((unsigned char)sx81e3[0] == 0x81); // Preserved
>> ^
>> t.cpp(8,40): error C2607: static assertion failed
>> static_assert((unsigned char)sx81e3[2] == 0);
>> ^
>>
>> Tom.
>>
>> On 9/9/20 11:40 AM, Tom Honermann via SG16 wrote:
>>
>> On 8/24/20 8:31 AM, Peter Brett via SG16 wrote:
>>
>> Hi all,
>>
>> In this week's meeting, we are going to discuss the remaining
>> proposals from P2178R1 "Misc lexing and string handling improvements".
>> In particular, we will discuss proposal 9:
>>
>> Proposal 9: Reaffirming Unicode as the character set of the
>> internal representation
>>
>> In anticipation of a lively discussion, Corentin and I have written a
>> short new paper which will be appearing in the September mailing.
>>
>> P2194R0 The character set of C++ source code is Unicode
>> https://isocpp.org/files/papers/P2194R0.pdf
>>
>> In preparation for this discussion, please also (re-)read section 5.2.1
>> of the C99 Rationale document
>> <http://www.open-std.org/jtc1/sc22/wg14/www/C99RationaleV5.10.pdf>; in
>> particular the "UCN Models" section on pages 20 and 21.
>>
>> Tom.
>>
>> We hope that the study group finds this contribution helpful and
>> informative.
>>
>> Best regards,
>>
>> Peter
>>
>>
>>
>>
>>
>>
>

Received on 2020-09-09 13:58:09