Date: Mon, 7 May 2018 13:31:38 +0200
Make char16_t/char32_t string literals be UTF-16/32
Document Number: P1041R0
Date: 2018-04-24
Audience: Evolution Working Group
Reply-to: cpp_at_[hidden]
Introduction
C++11 introduced character types suitable for code units of
the UTF-16 and UTF-32 encoding forms, namely char16_t
and char32_t
. Along with this, it also
introduced new string literals whose types are arrays of those
two character types, prefixed with u
and U
,
respectively. And last but not least, it also introduced UTF-8
string literals, prefixed with u8
, with
types arrays of const char
. Of these three new
string literal types, only one has a guarantee about the
values that the elements of the array have; in other words,
only one has a guaranteed encoding form, the UTF-8 string
literals.
The standard text hints that the char16_t
and char32_t
string literals are intended to be encoded as, respectively,
UTF-16 and UTF-32, but unlike it does for UTF-8 string
literals, it never explicitly makes such a requirement.
Motivation
In defining char16_t
string literals
([lex.string]/10), the standard makes a mention of “surrogate
pairs”:
A string-literal that begins with
u
, such asu"asdf"
, is achar16_t
string literal. Achar16_t
string literal has type “array of nconst char16_t
”, where n is the size of the string as defined below; it is initialized with the given characters. A single c-char may produce more than onechar16_t
character in the form of surrogate pairs.
Further down, when defining the size of char16_t
string literals ([lex.string]/15), there is another mention of
“surrogate pairs”:
The size of a
char16_t
string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for each character requiring a surrogate pair, plus one for the terminatingu'\0'
. [Note: The size of a char16_t string literal is the number of code units, not the number of characters. — end note]
For char32_t
string literals, the definition of
their size ([lex.string]/15) essentially limits the encoding
form used to one that doesn’t have more than one code unit per
character:
The size of a
char32_t
or wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminatingU'\0'
orL'\0'
.
Additionally, the standard constrains the range of universal-character-names to the range that is supported by all of the UTF encoding forms discussed here:
Within
char32_t
andchar16_t
string literals, any universal-character-names shall be within the range0x0
to0x10FFFF
.
All of these requirements, while never explicitly naming the UTF-16 or UTF-32 encoding forms, strongly imply that these are the encoding forms intended. Furthermore, it would be questionable for an implementation to pick any other encoding forms for these string literals: there is no well-known encoding form that uses a concept named “surrogate pair” other than UTF-16, and there is no well-known encoding form that encodes each character as a single 32-bit code unit other than UTF-32.
In practice, all implementations use UTF-16 and UTF-32 for these string literals. C++ should standardize this practice and make these requirements explicit instead of just hinting at them.
Proposal
This proposal renames "char16_t
string literals"
and "char32_t
string literals" to “UTF-16 string
literals” and “UTF-32 string literals”, to match the existing
“UTF-8 string literals”, and explicitly requires the object
representations of those literals to be the values that
correspond to the UTF-16 and UTF-32 (respectively) encodings
of the given characters.
Technical Specifications
-
Add to [lex.string]/10:
A string-literal that begins with
u
, such asu"asdf"
, is aUTF-16 string literal. Achar16_t
string literalUTF-16 string literal has type “array of nchar16_t
string literalconst char16_t
”, where n is the size of the string as defined below; it is initialized with the given characters. A single c-char may produce more than onechar16_t
character in the form of surrogate pairs. -
Change [lex.string]/11:
A string-literal that begins with
U
, such asU"asdf"
, is aUTF-32 string literal. Achar32_t
string literalUTF-32 string literal has type “array of nchar32_t
string literalconst char32_t
”, where n is the size of the string as defined below; it is initialized with the given characters. -
Insert a paragraph between [lex.string]/10 and /11:
For a UTF-16 string literal, each successive element of the object representation has the value of the corresponding code unit of the UTF-16 encoding of the string.
-
Insert a paragraph between [lex.string]/11 and /12:
For a UTF-32 string literal, each successive element of the object representation has the value of the corresponding code unit of the UTF-32 encoding of the string.
-
Change [lex.ccon]/4:
A character literal that begins with the letter
u
, such asu'x'
, is a character literal of typechar16_t
, known as a UTF-8 character literal. The value of aUTF-16 character literal containing a single c-char is equal to its ISO 10646 code point value, provided that the code point value is representable with a single 16-bit code unit (that is, provided it is in the basic multi-lingual plane). If the value is not representable with a single 16-bit code unit, the program is ill-formed. Achar16_t
UTF-16 character literal containing multiple c-chars is ill-formed.char16_t
-
Change [lex.ccon]/5:
A character literal that begins with the letter
U
, such asU'y'
, is a character literal of typechar32_t
. The value of aUTF-32 character literal containing a single c-char is equal to its ISO 10646 code point value. Achar32_t
UTF-32 character literal containing multiple c-chars is ill-formed.char32_t
Interaction with other papers
Currently, the standard lacks a normative reference to UTF-16, and UTF-32; however, it also lacks one such reference for UTF-8. This paper assumes the this problem will fixed for all three encodings in another paper, potentially D1025R0 (Update The Reference To The Unicode Standard).
This paper was also written so as to not conflict with P0482R2 (char8_t: A type for UTF-8 characters and strings (Revision 2)).
Received on 2018-05-07 13:32:49