C++ Logo


Advanced search

[SG16-Unicode] Abstract and notes for D1859R0: Standard terminology for execution character set encodings

From: Steve Downey <sdowney_at_[hidden]>
Date: Thu, 5 Sep 2019 21:41:42 -0400
Because I needed to circulate what I'm doing for Belfast, I've thrown
together an abstract for the paper we've peripherally discussed about
modernizing and tightening the specification around encodings of characters
generally, and the source and execution character sets.

This document proposes new standard terms for the various encodings for
character and string literals, and the encodings associated with some
character types. It also proposes that the wording used for [lex.charset],
[lex.ccon], [lex.string], and [basic.fundamental] 8 be modified to reflect
the new terminology. This paper does not intend to propose any changes that
would require changes in any currently conforming implementation.

I'm hoping to have some preliminary work by the next telecon. The direction
I'm thinking is that both Source and Execution Character Set are
descriptions of the abstract characters, selected from 10646, that must be
present to support C++. Encodings, both source and execution, are
implementation defined. I would like to introduce terminology to describe
the encoding used when translating narrow and wide character and string
literals. I'd also like to make it explicit somewhere up front that there
are associated encodings for some, but not all, character types. This is
mentioned now in filesystem, but should be moved to a section with wider
scope. The encoding for `char` and `wchar_t` is controlled by `locale`. The
encoding for the unicode character types is fixed. The encoding used for
literals was chosen at compile time, and is implementation defined. If
locale and that endcoding conflict, behavior is unspecified. Combining TU
with different encodings is in general unspecified, unless it results in an
ODR violation.

Some possible terms:
{"",Narrow,Wide} Literal Encoding - encoding on char and string literals
Dynamic Encoding - encoding implied by locale
*Character Set - A set of abstract characters ( Latin Capital letter A,
Digit Zero, Left Parenthesis ...)
*Basic Character Set - minimum required to be encoded
*Extended Character Set - what can be encoded
*Source Character Set - must be encodable in C++ source
*Execution Character Set - Source + control characters

* Current terms, with what I think the actual meanings are today.

Received on 2019-09-06 03:41:55