Subject: Re: Are ill-formed UTF literals well-formed?
From: Jens Maurer (Jens.Maurer_at_[hidden])
Date: 2021-04-24 15:21:53
On 24/04/2021 21.57, Tom Honermann via SG16 wrote:
> On 4/24/21 7:25 AM, Corentin via SG16 wrote:
>> Consider the following:
>> auto x = u8"\xC0";
>> In [decl.init.string], we say
>> > An array of ordinary character type ([basic.fundamental]), char8_Ât array, char16_Ât array, char32_Ât array, or wchar_Ât array can be initialized by an ordinary string literal, UTF-8 string literal, UTF-16 string literal, UTF-32 string literal, or wide string literal, respectively [..]
> The definition of "UTF-8 string literal" used here corresponds to [lex.string]p1 table 12 <http://eel.is/c++draft/lex.string#tab:lex.string.literal-row-5> and that definition does not impose a well-formed encoding requirement.Â How a string literal containing an escape sequence is encoded is specified in [lex.string]p10 <http://eel.is/c++draft/lex.string#10>, and no such requirements are imposed there either.Â [dcl.init.string]p1 <http://eel.is/c++draft/dcl.init.string#1> states "Successive characters of the value of the string-literal initialize the elements of the array".Â We can quibble over the use of the word "character" there, but I think the intent is clear; the elements of the string literal are copied.
> I think the reference to ISO 10646 only serves to provide a definition of "UTF-8" and to describe how /basic-s-chars/, /r-chars/, /simple-escape-sequences/, and /universal-character-names/ are encoded; I don't see how it can be interpreted to apply to /numeric-escape-sequences/.
"Successive characters of the value of the string-literal initialize the elements of the array."
should be changed to something closer to [lex.string] p10
"String literal objects are initialized with the sequence of code unit values corresponding to the string-literalâs
sequence of s-char s (for a non-raw string literal) and r-char s (for a raw string literal) in order as follows:"
It's the express intent that a numeric-escape-sequence produces
code units, and that those might not be valid encodings. One
use-case is for testing.
SG16 list run by firstname.lastname@example.org