The paper makes an excellent case for
unsigned char foo [] = u8""; and char foo [] = u8"";
However, the paper does give no justification for signed char. UTF-8 code units are numbers between [0-255]. As such the paper is proposing a conversion from unsigned char to char and does not specify how these signed char elements should be initialized in cases of overflow.
An important point was that u8 literals would always be valid, because all codepoints in the sequence would have representation in the storage. And this is the case in phase 5.
Neither
- We will make the program ill-formed if there is an overflow (in effect only allowing ASCII)
- We will just copy the bits over and now there are negative utf8 code units
Seem satisfactory solution.
And while it is easy to find many examples to motivate the paper in general, the case for signed char isn't motivated *at all*.
I would very much be in the paper if the wording was changed to
> Additionally, an array of char or unsigned char may be initialized by a UTF-8 string literal, or by such a string literal enclosed in braces
Thanks,
Corentin