sg16: Re: [SG16-Unicode] Abstract and notes for D1859R0: Standard terminology for execution character set encodings

From: Tom Honermann <tom_at_[hidden]>
Date: Sun, 8 Sep 2019 23:16:38 -0400

On 9/8/19 12:02 PM, Steve Downey wrote:
> Character repertoire sounds good, and I will eventually learn to spell
> it. Character set is definitely terminology from the pre-unicode
> times, and unfortunately tends to merge the repertoire and encoding,
> https://www.iana.org/assignments/character-sets/character-sets.xhtml

I think I was a little over zealous earlier in stating that Unicode uses
"character repertoire" as I described. I looked again and don't find
that term formally defined in the standard. However, "repertoire" is
used throughout the standard in ways that I believe are consistent with
my description. I wasn't able to find an alternative formal term.

The way I've been thinking about it is that a "character repertoire"
describes a set of /abstract characters/ (a formal Unicode term) and a
"character set" describes a set of /encoded characters/ (a formal
Unicode term) that associate each /abstract character/ member of a
"character repertoire" with a /code point/ (a formal Unicode term)
within a /codespace/ (A formal Unicode term). See sections 2.4 and 3.4
of Unicode 12 and uses of the word "repertoire" within those chapters.
The Unicode standard does use the term "character set", but I didn't
find a formal definition.

>
> Basic source character set is defined in [lex.charset]
> http://eel.is/c++draft/lex.charset#def:character_set,basic_source
Yes, and it defines a character repertoire. "Physical source file
characters" is the closest I've found to a term that describes the
actual implementation defined source character set.
>
> I'd like to get away from "execution encoding" because it conflates
> the presumed encoding and the one selected by the current locale. Now,
> admittedly, everyone conflates these and it's a source of error and
> mojibake, but perhaps with better words it would be easier to teach.
I agree. I like "dynamic encoding" because it accurately reflects the
reality that the encoding can be changed dynamically (by calls to
std::setlocale).
>
> As to UB. I'd like, if possible, to avoid creating new UB classes.
> Some things should probably be ill-formed, like unencodable
> characters. Others fall into existing UB, like specifying an inline
> string literal with two different encodings. Reading a string with the
> wrong encoding, I think, should be at worst unspecified, unless for
> some reason your decoder has UB, in which case it's the decoders
> problem, not the incorrect or mixed encoding isssue. That said, I'd
> defer to Core on this.
Wherever Core says we can get away with unspecified, I'm all for it.
>
> Internal encoding is required to preserve distinct universal character
> names and treat all representations of the same universal character
> the same. So, the standard effectively requires unicode, but in terms
> of observables.

Agreed, I don't think anything is accomplished by trying to prescribe
implementation details.

Tom.

>
>
>
> On Sun, Sep 8, 2019 at 5:39 AM Corentin Jabot <corentinjabot_at_[hidden]
> <mailto:corentinjabot_at_[hidden]>> wrote:
>
>
>
> On Sun, 8 Sep 2019 at 05:46, Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> On 9/5/19 9:41 PM, Steve Downey wrote:
>> Because I needed to circulate what I'm doing for Belfast,
>> I've thrown together an abstract for the paper we've
>> peripherally discussed about modernizing and tightening the
>> specification around encodings of characters generally, and
>> the source and execution character sets.
>>
>> "
>> This document proposes new standard terms for the various
>> encodings for character and string literals, and the
>> encodings associated with some character types. It also
>> proposes that the wording used for [lex.charset], [lex.ccon],
>> [lex.string], and [basic.fundamental] 8 be modified to
>> reflect the new terminology. This paper does not intend to
>> propose any changes that would require changes in any
>> currently conforming implementation.
>> "
>>
>> I'm hoping to have some preliminary work by the next telecon.
>> The direction I'm thinking is that both Source and Execution
>> Character Set are descriptions of the abstract characters,
>> selected from 10646, that must be present to support C++.
>> Encodings, both source and execution, are implementation
>> defined. I would like to introduce terminology to describe
>> the encoding used when translating narrow and wide character
>> and string literals. I'd also like to make it explicit
>> somewhere up front that there are associated encodings for
>> some, but not all, character types. This is mentioned now in
>> filesystem, but should be moved to a section with wider
>> scope. The encoding for `char` and `wchar_t` is controlled by
>> `locale`. The encoding for the unicode character types is
>> fixed. The encoding used for literals was chosen at compile
>> time, and is implementation defined. If locale and that
>> endcoding conflict, behavior is unspecified. Combining TU
>> with different encodings is in general unspecified, unless it
>> results in an ODR violation.
> This all sounds great. My only question is behavior being
> unspecified vs undefined. It seems challenging to get away
> with making it only unspecified.
>
>
> Specifically, I'd like something along the line of:
> If a character literal contains a c-char that do not have the same
> representation in the character literal encoding (aka *presumed"
> execution encoding) and the execution encoding, the behavior is
> undefined.
>
>
>
>>
>> Some possible terms:
>> {"",Narrow,Wide} Literal Encoding - encoding on char and
>> string literals
>> Dynamic Encoding - encoding implied by locale
>> *Character Set - A set of abstract characters ( Latin Capital
>> letter A, Digit Zero, Left Parenthesis ...)
> Unicode uses "character repertoire" for abstract sets of
> characters. I favor following suit there.
>
>
> +1 to sticking to Unicode terms
>
>> *Basic Character Set - minimum required to be encoded
>> *Extended Character Set - what can be encoded
>> *Source Character Set - must be encodable in C++ source
> I don't think "source character set" is defined today. The
> closest we get is "Physical source file characters" in
> [lex.phases]p1 <http://eel.is/c++draft/lex.phases#1.1>.
>> *Execution Character Set - Source + control characters
>
>
> Be careful not to break that code
> https://stackoverflow.com/questions/5508110/why-is-this-program-erroneously-rejected-by-three-c-compilers
> More seriously i think it would be beneficial (necessary even) to
> have a source character encoding / character repertoire.
>
>
> I wonder if we could specified that the internal character
> repertoire is Unicode. It kinda has to be already make that clearer.
>
>
> I would also propose
>
> Universal Character Name -> Unicode Code point
> (character name should be reserved to the \N proposal)
>
>
>>
>> * Current terms, with what I think the actual meanings are today.
>>
>>
> I think these are good. With these, there is no need for a
> term like "execution encoding", correct? At compile-time,
> "literal encoding" encodes "execution character set"
> characters, and at run-time, "dynamic encoding" encodes
> "extended character set" characters, yes?
>
> I prefer "execution" to dynamic
>
> I like that this doesn't stray far from the existing terms.
>
> Tom.
>
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden] <mailto:Unicode_at_[hidden]>
> http://www.open-std.org/mailman/listinfo/unicode
>

Received on 2019-09-09 05:16:44