C++ Logo

sg16

Advanced search

Re: [SG16] Are ill-formed UTF literals well-formed?

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Sat, 24 Apr 2021 22:21:53 +0200
On 24/04/2021 21.57, Tom Honermann via SG16 wrote:
> On 4/24/21 7:25 AM, Corentin via SG16 wrote:
>>
>> Hello,
>>
>> Consider the following:
>>
>> auto x = u8"\xC0";
>>
>> In [decl.init.string], we say
>>
>> > An array of ordinary character type ([basic.fundamental]), char8_­t array, char16_­t array, char32_­t array, or wchar_­t array can be initialized by an ordinary string literal, UTF-8 string literal, UTF-16 string literal, UTF-32 string literal, or wide string literal, respectively [..]
>
> The definition of "UTF-8 string literal" used here corresponds to [lex.string]p1 table 12 <http://eel.is/c++draft/lex.string#tab:lex.string.literal-row-5> and that definition does not impose a well-formed encoding requirement. How a string literal containing an escape sequence is encoded is specified in [lex.string]p10 <http://eel.is/c++draft/lex.string#10>, and no such requirements are imposed there either. [dcl.init.string]p1 <http://eel.is/c++draft/dcl.init.string#1> states "Successive characters of the value of the string-literal initialize the elements of the array". We can quibble over the use of the word "character" there, but I think the intent is clear; the elements of the string literal are copied.
>
> I think the reference to ISO 10646 only serves to provide a definition of "UTF-8" and to describe how /basic-s-chars/, /r-chars/, /simple-escape-sequences/, and /universal-character-names/ are encoded; I don't see how it can be interpreted to apply to /numeric-escape-sequences/.

Agreed.

"Successive characters of the value of the string-literal initialize the elements of the array."

should be changed to something closer to [lex.string] p10

"String literal objects are initialized with the sequence of code unit values corresponding to the string-literal’s
sequence of s-char s (for a non-raw string literal) and r-char s (for a raw string literal) in order as follows:"

It's the express intent that a numeric-escape-sequence produces
code units, and that those might not be valid encodings. One
use-case is for testing.

Jens

Received on 2021-04-24 15:22:00