sg16: Re: [SG16-Unicode] Abstract and notes for D1859R0: Standard terminology for execution character set encodings

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Sun, 8 Sep 2019 11:39:22 +0200

On Sun, 8 Sep 2019 at 05:46, Tom Honermann <tom_at_[hidden]> wrote:

> On 9/5/19 9:41 PM, Steve Downey wrote:
>
> Because I needed to circulate what I'm doing for Belfast, I've thrown
> together an abstract for the paper we've peripherally discussed about
> modernizing and tightening the specification around encodings of characters
> generally, and the source and execution character sets.
>
> "
> This document proposes new standard terms for the various encodings for
> character and string literals, and the encodings associated with some
> character types. It also proposes that the wording used for [lex.charset],
> [lex.ccon], [lex.string], and [basic.fundamental] 8 be modified to reflect
> the new terminology. This paper does not intend to propose any changes that
> would require changes in any currently conforming implementation.
> "
>
> I'm hoping to have some preliminary work by the next telecon. The
> direction I'm thinking is that both Source and Execution Character Set are
> descriptions of the abstract characters, selected from 10646, that must be
> present to support C++. Encodings, both source and execution, are
> implementation defined. I would like to introduce terminology to describe
> the encoding used when translating narrow and wide character and string
> literals. I'd also like to make it explicit somewhere up front that there
> are associated encodings for some, but not all, character types. This is
> mentioned now in filesystem, but should be moved to a section with wider
> scope. The encoding for `char` and `wchar_t` is controlled by `locale`. The
> encoding for the unicode character types is fixed. The encoding used for
> literals was chosen at compile time, and is implementation defined. If
> locale and that endcoding conflict, behavior is unspecified. Combining TU
> with different encodings is in general unspecified, unless it results in an
> ODR violation.
>
> This all sounds great. My only question is behavior being unspecified vs
> undefined. It seems challenging to get away with making it only
> unspecified.
>

Specifically, I'd like something along the line of:
If a character literal contains a c-char that do not have the same
representation in the character literal encoding (aka *presumed" execution
encoding) and the execution encoding, the behavior is undefined.

>
> Some possible terms:
> {"",Narrow,Wide} Literal Encoding - encoding on char and string literals
> Dynamic Encoding - encoding implied by locale
> *Character Set - A set of abstract characters ( Latin Capital letter A,
> Digit Zero, Left Parenthesis ...)
>
> Unicode uses "character repertoire" for abstract sets of characters. I
> favor following suit there.
>

+1 to sticking to Unicode terms

> *Basic Character Set - minimum required to be encoded
> *Extended Character Set - what can be encoded
> *Source Character Set - must be encodable in C++ source
>
> I don't think "source character set" is defined today. The closest we get
> is "Physical source file characters" in [lex.phases]p1
> <http://eel.is/c++draft/lex.phases#1.1>.
>
> *Execution Character Set - Source + control characters
>
>
Be careful not to break that code
https://stackoverflow.com/questions/5508110/why-is-this-program-erroneously-rejected-by-three-c-compilers
More seriously i think it would be beneficial (necessary even) to have a
source character encoding / character repertoire.

I wonder if we could specified that the internal character repertoire is
Unicode. It kinda has to be already make that clearer.

I would also propose

Universal Character Name -> Unicode Code point
(character name should be reserved to the \N proposal)

> * Current terms, with what I think the actual meanings are today.
>
>
> I think these are good. With these, there is no need for a term like
> "execution encoding", correct? At compile-time, "literal encoding" encodes
> "execution character set" characters, and at run-time, "dynamic encoding"
> encodes "extended character set" characters, yes?
>
I prefer "execution" to dynamic

> I like that this doesn't stray far from the existing terms.
>
> Tom.
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode
>

Received on 2019-09-08 11:39:35