C++ Logo

SG16

Advanced search

Subject: Re: Updates for D2314R1: Character sets and encodings
From: Corentin Jabot (corentinjabot_at_[hidden])
Date: 2021-02-25 03:21:52


On Thu, Feb 25, 2021 at 9:27 AM Jens Maurer via SG16 <sg16_at_[hidden]>
wrote:

>
> https://wiki.edg.com/pub/Wg21virtual2021-02/SG16/d2314r1.html
>
> In response to yesterday's discussion:
>
> - header-names can now contain any character; they are mapped to files
> in an implementation-defined manner, so it's up to the implementation what
> it does with a character sequence that looks like a UCN.
>

Great!

>
> - The definition of execution (wide) character set was moved to
> [character.seq],
> including defining what "locale-specific" means.
>
Great!

>

>
> - Dropped the (recently added) requirement about encoding consistency
> between the literal encoding and the execution (runtime) encoding
> (reflects existing practice).
>
> New text:
>
> "The execution character set and the execution wide-character set are
> supersets
> of the basic literal character set (5.3 [lex.charset]). The encodings of
> the
> execution character sets and the sets of additional elements (if any) are
> locale-specific. [ Note: The encoding of the execution character sets can
> be
> unrelated to any literal encoding. -- end note ]"
>

Would you consider removing the note until we get time to see if that is
exactly the case?
I think I may have found a way to be a bit more exact than the note(but
need time to think about it and I don't think we need to resolve it for
this paper)- everything else is fine by me!

>
> - Hubert's observation that code unit semantics was changed has been
> fixed;
> the text now reads
>
> "A literal encoding encodes each element of the basic literal character
> set as a single code unit with non-negative value, distinct from the
> code unit for any other such element. [ Note: A character not in the
> basic literal character set can be encoded with more than one code unit;
> the value of such a code unit can be the same as that of a code unit
> for an element of the basic literal character set. -- end note ]."
>

Suggestion:
A literal encoding encodes each character as a distinct [Note: or state
shifted] sequence
of code units. Elements of the basic (literal) character set are encoded as
a single code unit with non-negative value

>
>
> The remaining differences with Corentin's P2297R0 are
>
> - basic literal character set
>
> There are now four uses of the term in my paper, so it seems to be a useful
> descriptive tool. (Suggestions to unify "basic character set" and
> "basic literal character set" would imply semantic changes to the status
> quo or would use more words, I believe.)
>
>
> - translation character set
>
> We agree there is no difference on the (intended) semantics either way;
> I believe this is simply a question of presentation in the standard.
> My definition (aligned with ISO 10646 terminology) currently reads:
>
> The translation character set consists of the following elements:
>
> - each character named by ISO/IEC 10646, as identified by its unique UCS
> scalar value, and
> - a distinct character for each UCS scalar value where no named character
> is assigned.
>
>
> Jens
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>



SG16 list run by sg16-owner@lists.isocpp.org