Subject: Re: Updates for D2314R1: Character sets and encodings
From: Corentin Jabot (corentinjabot_at_[hidden])
Date: 2021-02-25 03:21:52
On Thu, Feb 25, 2021 at 9:27 AM Jens Maurer via SG16 <sg16_at_[hidden]>
> In response to yesterday's discussion:
> - header-names can now contain any character; they are mapped to files
> in an implementation-defined manner, so it's up to the implementation what
> it does with a character sequence that looks like a UCN.
> - The definition of execution (wide) character set was moved to
> including defining what "locale-specific" means.
> - Dropped the (recently added) requirement about encoding consistency
> between the literal encoding and the execution (runtime) encoding
> (reflects existing practice).
> New text:
> "The execution character set and the execution wide-character set are
> of the basic literal character set (5.3 [lex.charset]). The encodings of
> execution character sets and the sets of additional elements (if any) are
> locale-specific. [ Note: The encoding of the execution character sets can
> unrelated to any literal encoding. -- end note ]"
Would you consider removing the note until we get time to see if that is
exactly the case?
I think I may have found a way to be a bit more exact than the note(but
need time to think about it and I don't think we need to resolve it for
this paper)- everything else is fine by me!
> - Hubert's observation that code unit semantics was changed has been
> the text now reads
> "A literal encoding encodes each element of the basic literal character
> set as a single code unit with non-negative value, distinct from the
> code unit for any other such element. [ Note: A character not in the
> basic literal character set can be encoded with more than one code unit;
> the value of such a code unit can be the same as that of a code unit
> for an element of the basic literal character set. -- end note ]."
A literal encoding encodes each character as a distinct [Note: or state
of code units. Elements of the basic (literal) character set are encoded as
a single code unit with non-negative value
> The remaining differences with Corentin's P2297R0 are
> - basic literal character set
> There are now four uses of the term in my paper, so it seems to be a useful
> descriptive tool. (Suggestions to unify "basic character set" and
> "basic literal character set" would imply semantic changes to the status
> quo or would use more words, I believe.)
> - translation character set
> We agree there is no difference on the (intended) semantics either way;
> I believe this is simply a question of presentation in the standard.
> My definition (aligned with ISO 10646 terminology) currently reads:
> The translation character set consists of the following elements:
> - each character named by ISO/IEC 10646, as identified by its unique UCS
> scalar value, and
> - a distinct character for each UCS scalar value where no named character
> is assigned.
> SG16 mailing list
SG16 list run by firstname.lastname@example.org