C++ Logo

sg16

Advanced search

[SG16-Unicode] Draft string literal issues

From: Steve Downey <sdowney_at_[hidden]>
Date: Sat, 16 Mar 2019 13:46:12 -0400
Study Group 16 has recently noticed two issues with string literals in the
current WP.

The first is in lex.phases#1.5 ( http://eel.is/c++draft/lex.phases#1.5 )
where characters in all string literals are converted into the execution
character set, which should be true only for un-prefixed literals. U, u,
and u8 string literals should be converted to UTF-32, -16, and -8 each
respectively, and wide literals into the wide encoding.

The second however, follows from the first, where string literals are
concatenated after being translated. lex.string#12 (
http://eel.is/c++draft/lex.string#12.note-1 ) teaches that "If one
string-literal has no encoding-prefix, it is treated as a string-literal of
the same encoding-prefix as the other operand. " However, since this
happens _after_ encoding, there is no sensible way to achieve this. The
execution encoding will not, in general, be valid Unicode encoding, and if
it happens to be, it will not encode the same source characters. The
conversion from universal-character-name to execution encoding will also,
in general, be lossy leading to replacement characters, like '?', in the
strings.

SG16 has not reached consensus on how the issue should be resolved, except
that creating mojibake as happens in practice now is undesireable.

MSVC 19.16 for example when processing `char16_t text1[] = u"" "\u0102";`
with the utf-8 option encodes the string literal as {0xC4 0x82}, then
treats that pair of bytes as Windows 1252, the normal execution encoding,
before reencoding as UTF-16, {0x00C4 0x201A}, where the first character is
U+00C4 LATIN CAPITAL LETTER A WITH DIAERESIS, and the second is U+201A,
SINGLE LOW-9 QUOTATION MARK, which is equivalent to 0x82 in 1252.

Received on 2019-03-16 18:46:26