sg16: Re: [SG16-Unicode] Abstract and notes for D1859R0: Standard terminology for execution character set encodings

From: Tom Honermann <tom_at_[hidden]>
Date: Sat, 7 Sep 2019 23:46:20 -0400

On 9/5/19 9:41 PM, Steve Downey wrote:
> Because I needed to circulate what I'm doing for Belfast, I've thrown
> together an abstract for the paper we've peripherally discussed about
> modernizing and tightening the specification around encodings of
> characters generally, and the source and execution character sets.
>
> "
> This document proposes new standard terms for the various encodings
> for character and string literals, and the encodings associated with
> some character types. It also proposes that the wording used for
> [lex.charset], [lex.ccon], [lex.string], and [basic.fundamental] 8 be
> modified to reflect the new terminology. This paper does not intend to
> propose any changes that would require changes in any currently
> conforming implementation.
> "
>
> I'm hoping to have some preliminary work by the next telecon. The
> direction I'm thinking is that both Source and Execution Character Set
> are descriptions of the abstract characters, selected from 10646, that
> must be present to support C++. Encodings, both source and execution,
> are implementation defined. I would like to introduce terminology to
> describe the encoding used when translating narrow and wide character
> and string literals. I'd also like to make it explicit somewhere up
> front that there are associated encodings for some, but not all,
> character types. This is mentioned now in filesystem, but should be
> moved to a section with wider scope. The encoding for `char` and
> `wchar_t` is controlled by `locale`. The encoding for the unicode
> character types is fixed. The encoding used for literals was chosen at
> compile time, and is implementation defined. If locale and that
> endcoding conflict, behavior is unspecified. Combining TU with
> different encodings is in general unspecified, unless it results in an
> ODR violation.
This all sounds great. My only question is behavior being unspecified
vs undefined. It seems challenging to get away with making it only
unspecified.
>
> Some possible terms:
> {"",Narrow,Wide} Literal Encoding - encoding on char and string literals
> Dynamic Encoding - encoding implied by locale
> *Character Set - A set of abstract characters ( Latin Capital letter
> A, Digit Zero, Left Parenthesis ...)
Unicode uses "character repertoire" for abstract sets of characters. I
favor following suit there.
> *Basic Character Set - minimum required to be encoded
> *Extended Character Set - what can be encoded
> *Source Character Set - must be encodable in C++ source
I don't think "source character set" is defined today. The closest we
get is "Physical source file characters" in [lex.phases]p1
<http://eel.is/c++draft/lex.phases#1.1>.
> *Execution Character Set - Source + control characters
>
> * Current terms, with what I think the actual meanings are today.
>
>
I think these are good. With these, there is no need for a term like
"execution encoding", correct? At compile-time, "literal encoding"
encodes "execution character set" characters, and at run-time, "dynamic
encoding" encodes "extended character set" characters, yes?

I like that this doesn't stray far from the existing terms.

Tom.

Received on 2019-09-08 05:46:26