C++ Logo

sg16

Advanced search

Re: [SG16] P2194R0 The character set of C++ source code is Unicode

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 9 Sep 2020 17:39:21 -0400
On 9/9/20 2:54 PM, Corentin wrote:
>
>
> On Wed, 9 Sep 2020 at 20:39, Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> On 9/9/20 1:26 PM, Corentin wrote:
>>
>>
>> On Wed, Sep 9, 2020, 18:42 Tom Honermann <tom_at_[hidden]
>> <mailto:tom_at_[hidden]>> wrote:
>>
>> I conducted an experiment today that I've been meaning to do
>> for a while now and that is relevant for this paper (and
>> perhaps a worthwhile addition to the paper).
>>
>>
>> Thanks Tom, this is very interesting.
>> But it is only marginally relevant to the paper which if I
>> remember correctly goes into some details about round-trip.
>> It is not something the paper is trying to neither prevent nor
>> mandate.
>> We only seek to mandate there exist a transformation from source
>> to Unicode, which doesn't imply nor require that the
>> transformation be reversible.
>> Support (or lack thereof) of round-tripping is possible within
>> these constraints and never observable from the program. And as
>> you observed it doesn't match existing practices - except on some
>> edg derived compilers.
> Discussion is not limited to what is proposed in the paper and is
> encouraged in order to probe suitability of a proposal within the
> full complexity of the C++ ecosystem. This experiment and other
> discussion on the mailing list is intended to probe the full
> solution space. The goal of such questioning is to identify
> solutions that increase consensus.
>
>
> Are you proposing that a specific round tripping behavior for shift
> jis should be mandated?

No, this was intended for informational purposes to inform discussion.
Specifically as a data point for evaluation of conformance and
consistency between the standard, existing implementations, and proposals.

Tom.

>>
>>
>>
>> Microsoft code page 932 (Microsoft's Shift JIS variant)
>> defines a number of code points that do not round trip
>> through Unicode due to having duplicate code point
>> assignments that are (quite reasonably) not duplicated in
>> Unicode. One of them is:
>>
>> 0x8795 -> U+221a -> 0x81e3 Square Root
>>
>> The following test case demonstrates existing behavior as
>> exhibited by gcc (11.0.0 snapshot) and Visual C++ (2019).
>> Both of these compilers accept the test case when compiled
>> with the command lines shown. Repeating the experiment will
>> require substituting the replacement characters in the string
>> literal with the indicated Shift JIS double byte sequence (or
>> using the attached file if it survives transmission).
>>
>> $ cat t.cpp
>> constexpr char sx8795[] = "��"; // 0x87 0x95 => CP932
>> 0x8795 == U+221A (Square Root)
>> static_assert((unsigned char)sx8795[0] == 0x81); //
>> Converted to CP932 0x81e3
>> static_assert((unsigned char)sx8795[1] == 0xe3); //
>> Converted to CP932 0x81e3
>> static_assert((unsigned char)sx8795[2] == 0);
>> constexpr char sx81e3[] = "��"; // 0x81 0xe3 => CP932
>> 0x81e3 == U+221A (Square Root)
>> static_assert((unsigned char)sx81e3[0] == 0x81); // Preserved
>> static_assert((unsigned char)sx81e3[1] == 0xe3); // Preserved
>> static_assert((unsigned char)sx81e3[2] == 0);
>>
>> $ g++ -c -finput-charset=cp932 -fexec-charset=cp932
>> -std=c++17 t.cpp
>> <no errors>
>>
>> $ cl /c /std:c++17 /source-charset:.932
>> /execution-charset:.932 t.cpp
>> <no errors>
>>
>> Note that both compilers converted the source 0x8795 double
>> byte sequence to 0x81e3; the original source bytes were not
>> preserved.
>>
>> However, both compilers fail the test case if character set
>> conversions are not specified:
>>
>> $ g++ -c -std=c++17 t.cpp
>> t.cpp:2:40: error: static assertion failed
>> 2 | static_assert((unsigned char)sx8795[0] == 0x81);
>> // Converted to CP932 0x81e3
>> | ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~
>> t.cpp:3:40: error: static assertion failed
>> 3 | static_assert((unsigned char)sx8795[1] == 0xe3);
>> // Converted to CP932 0x81e3
>> | ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~
>>
>> $ cl /nologo /diagnostics:caret /c /std:c++17 t.cpp
>> t.cpp
>> t.cpp(2,40): error C2607: static assertion failed
>> static_assert((unsigned char)sx8795[0] == 0x81); //
>> Converted to CP932 0x81e3
>> ^
>> t.cpp(3,40): error C2607: static assertion failed
>> static_assert((unsigned char)sx8795[1] == 0xe3); //
>> Converted to CP932 0x81e3
>> ^
>>
>> Note that these double byte sequences are not valid UTF-8
>> sequences, so the pass-the-bytes mode exhibited by gcc is not
>> an artifact of normal UTF-8 handling. The sequences are
>> valid for Windows-1252 (which is the default encoding used by
>> Visual C++ on the system I tested on), so this test is not
>> indicative of a pass-the-bytes mode for Visual C++.
>>
>>
>> When gcc assumes the source is utf-8, iconv is not called and no
>> check or conversations is performed, I believe that's what you
>> are seeing here.
>
> Gcc exhibits the same behavior whether the source encoding is
> assumed to be UTF-8 or explicitly indicated as such. I think the
> more complicated answer is that gcc doesn't use iconv for string
> literal contents when the source and execution character sets are
> the same (or it ignores conversion errors in that case). Errors
> are (necessarily) issued when the source and execution character
> sets differ:
>
> $ g++ -c -std=c++17 -finput-charset=utf-8 -fexec-charset=cp932
> t.cpp
> t.cpp:1:27: error: converting to execution character set:
> Invalid or incomplete multibyte or wide character
> 1 | constexpr char sx8795[] = "��"; // 0x87 0x95 => CP932
> 0x8795 == U+221A (Square Root)
> | ^~~~
>
> ...
>
> t.cpp:5:27: error: converting to execution character set:
> Invalid or incomplete multibyte or wide character
> 5 | constexpr char sx81e3[] = "��"; // 0x81 0xe3 => CP932
> 0x81e3 == U+221A (Square Root)
> | ^~~~
> ...
>
> This example, as well as the preceding and following quoted ones
> are examples of mojibake and are used in this context to
> illustrate behavior with ill-formed UTF-8 input. I failed to
> point that out previously.
>
> Tom.
>
>>
>>
>> Finally, it is worth noting that explicitly treating the
>> source as UTF-8 does not cause any additional errors for gcc,
>> but does for Visual C++ (Gcc does not diagnose the ill-formed
>> UTF-8 sequences in the string literals, but Visual C++ does).
>>
>> $ g++ -c -finput-charset=utf-8 -std=c++17 t.cpp
>> t.cpp:2:40: error: static assertion failed
>> 2 | static_assert((unsigned char)sx8795[0] == 0x81);
>> // Converted to CP932 0x81e3
>> | ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~
>> t.cpp:3:40: error: static assertion failed
>> 3 | static_assert((unsigned char)sx8795[1] == 0xe3);
>> // Converted to CP932 0x81e3
>> | ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~
>>
>> $ cl /nologo /diagnostics:caret /c /std:c++17 /utf-8 t.cpp
>> t.cpp
>> t.cpp(1,1): warning C4828: The file contains a character
>> starting at offset 0x1b that is illegal in the current
>> source character set (codepage 65001).
>> constexpr char sx8795[] = ""; // 0x87 0x95 => CP932
>> 0x8795 == U+221A (Square Root)
>> ^
>> t.cpp(1,1): warning C4828: The file contains a character
>> starting at offset 0x1c that is illegal in the current
>> source character set (codepage 65001).
>> constexpr char sx8795[] = ""; // 0x87 0x95 => CP932
>> 0x8795 == U+221A (Square Root)
>> ^
>> t.cpp(1,1): warning C4828: The file contains a character
>> starting at offset 0x13e that is illegal in the current
>> source character set (codepage 65001).
>> constexpr char sx8795[] = ""; // 0x87 0x95 => CP932
>> 0x8795 == U+221A (Square Root)
>> ^
>> t.cpp(1,1): warning C4828: The file contains a character
>> starting at offset 0x13f that is illegal in the current
>> source character set (codepage 65001).
>> constexpr char sx8795[] = ""; // 0x87 0x95 => CP932
>> 0x8795 == U+221A (Square Root)
>> ^
>> t.cpp(2,40): error C2607: static assertion failed
>> static_assert((unsigned char)sx8795[0] == 0x81); //
>> Converted to CP932 0x81e3
>> ^
>> t.cpp(3,40): error C2607: static assertion failed
>> static_assert((unsigned char)sx8795[1] == 0xe3); //
>> Converted to CP932 0x81e3
>> ^
>> t.cpp(5,27): error C2001: newline in constant
>> constexpr char sx81e3[] = ""; // 0x81 0xe3 => CP932
>> 0x81e3 == U+221A (Square Root)
>> ^
>> t.cpp(6,1): error C2143: syntax error: missing ';' before
>> 'static_assert'
>> static_assert((unsigned char)sx81e3[0] == 0x81); // Preserved
>> ^
>> t.cpp(8,40): error C2607: static assertion failed
>> static_assert((unsigned char)sx81e3[2] == 0);
>> ^
>>
>> Tom.
>>
>> On 9/9/20 11:40 AM, Tom Honermann via SG16 wrote:
>>> On 8/24/20 8:31 AM, Peter Brett via SG16 wrote:
>>>> Hi all,
>>>>
>>>> In this week's meeting, we are going to discuss the remaining
>>>> proposals from P2178R1 "Misc lexing and string handling improvements".
>>>> In particular, we will discuss proposal 9:
>>>>
>>>> Proposal 9: Reaffirming Unicode as the character set of the
>>>> internal representation
>>>>
>>>> In anticipation of a lively discussion, Corentin and I have written a
>>>> short new paper which will be appearing in the September mailing.
>>>>
>>>> P2194R0 The character set of C++ source code is Unicode
>>>> https://isocpp.org/files/papers/P2194R0.pdf
>>>
>>> In preparation for this discussion, please also (re-)read
>>> section 5.2.1 of the C99 Rationale document
>>> <http://www.open-std.org/jtc1/sc22/wg14/www/C99RationaleV5.10.pdf>;
>>> in particular the "UCN Models" section on pages 20 and 21.
>>>
>>> Tom.
>>>
>>>> We hope that the study group finds this contribution helpful and
>>>> informative.
>>>>
>>>> Best regards,
>>>>
>>>> Peter
>>>>
>>>
>>>
>>
>


Received on 2020-09-09 16:42:53