C++ Logo


Advanced search

Subject: Re: Updates for D2314R1: Character sets and encodings
From: Jens Maurer (Jens.Maurer_at_[hidden])
Date: 2021-03-02 14:38:28

On 25/02/2021 10.21, Corentin Jabot wrote:
> On Thu, Feb 25, 2021 at 9:27 AM Jens Maurer via SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:

>  - Dropped the (recently added) requirement about encoding consistency
> between the literal encoding and the execution (runtime) encoding
> (reflects existing practice).
> New text:
> "The execution character set and the execution wide-character set are supersets
> of the basic literal character set (5.3 [lex.charset]). The encodings of the
> execution character sets and the sets of additional elements (if any) are
> locale-specific. [ Note: The encoding of the execution character sets can be
> unrelated to any literal encoding. -- end note ]"
> Would you consider removing the note until we get time to see if that is exactly the case?

I think we've meanwhile seen that this is, in practice, the case,
as undesirable that might be. I think the note is important
to highlight this (surprising) state of affairs.

> I think I may have found a way to be a bit more exact than the note(but need time to think about it and I don't think we need to resolve it for this paper)- everything else is fine by me!

Then, let's have your future paper remove or alter that note.

>  - Hubert's observation that code unit semantics was changed has been fixed;
> the text now reads
> "A literal encoding encodes each element of the basic literal character
> set as a single code unit with non-negative value, distinct from the
> code unit for any other such element. [ Note: A character not in the
> basic literal character set can be encoded with more than one code unit;
> the value of such a code unit can be the same as that of a code unit
> for an element of the basic literal character set. -- end note ]."
> Suggestion:
> A literal encoding encodes each character as a distinct [Note: or state shifted] sequence
> of code units. Elements of the basic (literal) character set are encoded as a single code unit with non-negative value

This doesn't say the same thing.

Ignoring the note (which is non-normative), I think this is
wrong for shift-state encodings. Also, I think both the
status quo formulation as well as my paper allows an encoding
of (many) otherwise unencodable non-basic characters as "?" or
similar. Which I think happens in practice for some implementations.


SG16 list run by sg16-owner@lists.isocpp.org