On Jan 24, 2022, at 5:04 AM, Corentin Jabot via SG16 <sg16@lists.isocpp.org> wrote:


The paper makes an excellent case for

unsigned char foo [] = u8""; and char foo [] = u8"";

However, the paper does give no justification for signed char. UTF-8 code units are numbers between [0-255]. As such the paper is proposing a conversion from unsigned char to char and does not specify how these signed char elements should be initialized in cases of overflow.

An important point was that u8 literals would always be valid, because all codepoints in the sequence would have representation in the storage. And this is the case in phase 5. 
Neither
  • We will make the program ill-formed if there is an overflow (in effect only allowing ASCII)
  • We will just copy the bits over and now there are negative utf8 code units
Seem satisfactory solution.

And while it is easy to find many examples to motivate the paper in general, the case for signed char isn't motivated *at all*.

Thank you for pointing this out; we should update the paper to discuss this.

The allowance for signed char was added at my suggestion and was motivated for compatibility with C and consistency with ordinary string literals. 

Since char might be a signed type, the concerns regarding overflow and negative code units apply there as well. These concerns apply for ordinary string literals when the literal encoding is UTF-8 too. 

Both C and C++ allow arrays of signed char to be initialized with an ordinary string literal. I think (continuing) to allow the same for UTF-8 string literals (assuming acceptance as a C++20 DR) makes sense just to avoid a (minor) compatibility issue. 

This is also consistent with initialization of signed char with a UTF-8 character literal; that is well-formed thanks to implicit integer conversions. 

I think use of signed char for text is quite rare, so I don’t have strongly held opinions on this. 

Tom. 


I would very much be in the paper if the wording was changed to

>  Additionally, an array of char or unsigned char may be initialized by a UTF-8 string literal, or by such a string literal enclosed in braces

Thanks,

Corentin




On Sat, Jan 22, 2022 at 10:31 PM Jens Maurer via SG16 <sg16@lists.isocpp.org> wrote:
Hi,

Here are my comments:

 - Section 3.2, fifth to last word should not have an apostrophe.

 - The title promises "fixes", but I can see only a single fix in the
wording: Allow initialization of an ordinary character array with
a UTF-8 string literal.  Where are the several fixes?

 - Wording:

"Additionally, an array of ordinary character type may be initialized by a UTF-8
string literal, or by a char8_t-typed string-literal enclosed in braces."

I agree that "may" (giving permission) is the better verb here compared to
"can" in the preceding, existing text.

However, we discuss here "UTF-8 string literal", and a few words later we talk
about a "char8_t-typed string-literal".  Is there any intended difference between
these?   If so, I need help in seeing the difference.  If not, just say
", or by such a string literal enclosed in braces."

Jens
--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16
--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16