C++ Logo

sg16

Advanced search

[SG16-Unicode] P1859R0: Standard terminology for execution character set encodings

From: Steve Downey <sdowney_at_[hidden]>
Date: Tue, 8 Oct 2019 23:14:08 -0400
Standard terminology character sets and encodings
Document #: P1859R0
Date: 2019-10-06
Project: Programming Language C++
SG16
EWG
CWG
Reply-to: Steve Downey
<sdowney_at_[hidden], sdowney2_at_[hidden]>

Abstract: This document proposes new standard terms for the various
encodings for character and string literals, and the encodings associated
with some character types. It also proposes that the wording used for
[lex.charset], [lex.ccon], [lex.string], and [basic.fundamental] 8 be
modified to reflect the new terminology. This paper does not intend to
propose any changes that would require changes in any currently conforming
implementation.
1 Introduction <#introduction>

In discussions around understanding the current capabilities of C++ and
proposing new capabilities and facilities, SG16 has found that the current
standard wording is often unclear, and does not match well the language
currently used in 10646 and the Unicode Standard. This makes having
technical discussions difficult. For example, the phrase “execution
encoding” often comes up, or “presumed execution encoding”, trying to
describe the encodings of char literals and strings as interpreted by the
character classification functions. This conflates several concepts, and is
not actually standard terminology. It would be useful to have standard
terminology that did cover these concepts.

Execution character set is a standard term, however it defines what *abstract
characters* must be included in the *character repertoire* of the character
set used to encode C++, specifically the various kinds of character
literals. That character set is a strict superset of the source character
set, which defines the *abstract characters* must be in the *character
repertoire* of the character set used to write C++ source code. The
encodings of those character sets are not specified, and in fact there may
be several encodings used depending on the context or kind of literal.

There are five encodings that are associated with the five kinds of
character literals, corresponding to char, wchar_t, char8_t, char16_t, and
char32_t. For 8, 16, and 32, the encodings must be UTF-8, UTF-16, and
UTF-32. There are no associated encodings for unsigned char or signed char.

The encoding used for narrow and wide character and string literals is
implementation defined, and is, of course, fixed at translation time.

At runtime, however, interpretation of character data is usually controlled
by locale, either explicitly, or via the locale specified by setlocale().
The dynamic locale may not be the same as the literal encoding used at
translation time. This is a source of errors in text processing.

Another source of problems is the baked in assumption that a single wchar_t can
encode any representation character. For ABIs where wchar_t is 16 bits,
this is not true, and many of the NTMBS functions are incomplete, as they
do not allow for stateful wide character encodings.
2 Terms <#terms>Literal EncodingThe encoding used for character and wide
character and string literals in a translation unit.Dynamic EncodingThe
encoding implied by the LC_CTYPE category of locale.Character Set [
https://unicode.org/glossary/#character_set]A collection of elements used
to represent textual information.Abstract Character [
https://unicode.org/glossary/#abstract_character]A unit of information used
for the organization, control, or representation of textual data.Character
Repertoire [https://unicode.org/glossary/#character_repertoire]The
collection of characters included in a character set.Basic source character
setThe abstract characters that must be representable in the character set
used for source codeBasic execution character setThe abstract characters
the character repertoire of the character set used for literals must
include. A superset of the abstract characters in the basic source
character set.Execution character setThe set of abstract characters
representable by a char or char string literalExecution wide-character setThe
set of abstract characters representable by a wchar_t or wchar_t string
literal3 Example of use (not an actual proposal, yet)
<#example-of-use-not-an-actual-proposal-yet>3.1 Proposal Dnnnn
<#proposal-dnnnn>3.1.1 literal_encoding <#literal_encoding>

Returns an *unspecified* callable taking a range of elements of type char and
returning a view of of code points decoded from the input range treating
them as being in the *literal encoding* used for the current translation
unit.
3.1.2 wide_literal_encoding <#wide_literal_encoding>

Returns an *unspecified* callable taking a range of elements of type char and
returning a view of of code points decoded from the input range treating
them as being in the *wide literal encoding* used for the current
translation unit.
3.2 Discussion of proposal Dnnnn <#discussion-of-proposal-dnnnn>

Still woefully underspecified, it is at least clear what is being
discussed, and how it might be something a compiler could implement.
Without the terms *literal encoding* and *wide literal encoding* discussion
gets bogged down quickly around the difference between what the compiler
does and what locale and the *dynamic encoding* imply for character
conversions.
4 Wording <#wording>

(lex.charset.1)The basic source character set consists of 96 abstract
characters:
the space character, the control characters representing horizontal tab,
vertical tab, form feed, and new-line, plus the following 91 graphical
characters:

 <#cb1-1>a b c d e f g h i j k l m n o p q r s t u v w x y z <#cb1-2>A
B C D E F G H I J K L M N O P Q R S T U V W X Y Z <#cb1-3>0 1 2 3 4 5
6 7 8 9 <#cb1-4>_ { } \[ \] # ( ) < > % : ; . ? * + - / ^ & | ~ ! = ,
\ " '

[Editorial Note: Should really be a list of unicode names or universal
names, aka code points e.g. LATIN CAPITAL LETTER A LATIN CAPITAL LETTER B]

(lex.charset.3)The basic execution character set and the basic execution
wide-character set shall each contain all the members abstract characters of
the basic source character set, plus control characters representing alert,
backspace, and carriage return, plus a null character (respectively, null
wide character), whose value is 0. For each element in the basic execution
character set, the encoded values of the members shall be non-negative and
distinct from one another. In both the source and execution basic character
sets, the value of each character after 0 in the above list of decimal
digits shall be one greater than the value of the previous. The execution
character set and the execution wide-character set are
implementation-defined supersets of the basic execution character set and
the basic execution wide-character set, respectively. The encoded values of
the members of the execution character sets and the sets of additional
members are implementation definedlocale-specific.

[lex.conn.2] A character literal that does not begin with u8, u, U, or L is
an ordinary character literal. An ordinary character literal that contains
a single c-char representable in the execution character set has type char,
with value equal to the numerical value of the encoding of the c-char in
the literal encoding. An ordinary character literal that contains more than
one c-char is a multicharacter literal. A multicharacter literal, or an
ordinary character literal containing a single c-char not representable in
the execution character set, is conditionally-supported, has type int, and
has an implementation-defined value.

[lex.conn.6] A character literal that begins with the letter L, such as
L’z’, is a wide-character literal. A wide-character literal has type wchar_t.
The value of a wide-character literal containing a single c-char has value
equal to the numerical value of the encoding of the c-char in the execution
wide-character setwide literal encoding, unless the c-char has no
representation in the execution wide-character set, in which case the value
is implementation-defined. [ Note: The type wchar_t is able to represent
all members of the execution wide-character set (see [basic.fundamental]).
— end note ] The value of a wide-character literal containing multiple
c-chars is implementation-defined.

Received on 2019-10-09 05:14:24