sg16: Re: [SG16-Unicode] Draft string literal issues

From: Tom Honermann <tom_at_[hidden]>
Date: Sun, 17 Mar 2019 00:42:11 -0400

On 3/16/19 1:46 PM, Steve Downey wrote:
> Study Group 16 has recently noticed two issues with string literals in
> the current WP.
>
> The first is in lex.phases#1.5 ( http://eel.is/c++draft/lex.phases#1.5
> ) where characters in all string literals are converted into the
> execution character set, which should be true only for un-prefixed
> literals. U, u, and u8 string literals should be converted to UTF-32,
> -16, and -8 each respectively, and wide literals into the wide encoding.
Sounds good. I'm struggling with the standard talking only about
character sets here and not encodings, but that is a different
pre-existing problem.
>
> The second however, follows from the first, where string literals are
> concatenated after being translated. lex.string#12 (
> http://eel.is/c++draft/lex.string#12.note-1 ) teaches that "If one
> string-literal has no encoding-prefix, it is treated as a
> string-literal of the same encoding-prefix as the other operand. "
> However, since this happens _after_ encoding, there is no sensible way
> to achieve this. The execution encoding will not, in general, be valid
> Unicode encoding, and if it happens to be, it will not encode the same
> source characters. The conversion from universal-character-name to
> execution encoding will also, in general, be lossy leading to
> replacement characters, like '?', in the strings.
I think it would be helpful to explicitly mention translation phases 5
and 6 here.
>
> SG16 has not reached consensus on how the issue should be resolved,
> except that creating mojibake as happens in practice now is undesireable.
I don't think it is necessary to mention SG16 when reporting the issue
since we don't (yet) have a proposed resolution to offer.
>
> MSVC 19.16 for example when processing `char16_t text1[] = u""
> "\u0102";` with the utf-8 option encodes the string literal as {0xC4
> 0x82}, then treats that pair of bytes as Windows 1252, the normal
> execution encoding, before reencoding as UTF-16, {0x00C4 0x201A},
> where the first character is U+00C4 LATIN CAPITAL LETTER A WITH
> DIAERESIS, and the second is U+201A, SINGLE LOW-9 QUOTATION MARK,
> which is equivalent to 0x82 in 1252.

I think it would be useful to present the current implementation
divergence between gcc/clang and MSVC (without use of the /utf-8 option
since that mode is clearly buggy). This would demonstrate that
gcc/clang doesn't differ behavior for `u"" "x"` vs `u"" u"x"` where as
MSVC does.

Richard offered a potential resolution on the std-discussion list, it
may be worth submitting his suggestion (with appropriate attribution of
course) with the issue.

https://groups.google.com/a/isocpp.org/d/msg/std-discussion/qYf6treuLmY/dljWwyawCwAJ

Tom.

Received on 2019-03-17 05:42:17