C++ Logo

sg16

Advanced search

Re: [SG16] Updates for D2314R1: Character sets and encodings

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Tue, 2 Mar 2021 21:38:28 +0100
On 25/02/2021 10.21, Corentin Jabot wrote:
>
>
> On Thu, Feb 25, 2021 at 9:27 AM Jens Maurer via SG16 <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:

> - Dropped the (recently added) requirement about encoding consistency
> between the literal encoding and the execution (runtime) encoding
> (reflects existing practice).
>
> New text:
>
> "The execution character set and the execution wide-character set are supersets
> of the basic literal character set (5.3 [lex.charset]). The encodings of the
> execution character sets and the sets of additional elements (if any) are
> locale-specific. [ Note: The encoding of the execution character sets can be
> unrelated to any literal encoding. -- end note ]"
>
>
> Would you consider removing the note until we get time to see if that is exactly the case?

I think we've meanwhile seen that this is, in practice, the case,
as undesirable that might be. I think the note is important
to highlight this (surprising) state of affairs.

> I think I may have found a way to be a bit more exact than the note(but need time to think about it and I don't think we need to resolve it for this paper)- everything else is fine by me!

Then, let's have your future paper remove or alter that note.

> - Hubert's observation that code unit semantics was changed has been fixed;
> the text now reads
>
> "A literal encoding encodes each element of the basic literal character
> set as a single code unit with non-negative value, distinct from the
> code unit for any other such element. [ Note: A character not in the
> basic literal character set can be encoded with more than one code unit;
> the value of such a code unit can be the same as that of a code unit
> for an element of the basic literal character set. -- end note ]."
>
>
> Suggestion:
> A literal encoding encodes each character as a distinct [Note: or state shifted] sequence
> of code units. Elements of the basic (literal) character set are encoded as a single code unit with non-negative value

This doesn't say the same thing.

Ignoring the note (which is non-normative), I think this is
wrong for shift-state encodings. Also, I think both the
status quo formulation as well as my paper allows an encoding
of (many) otherwise unencodable non-basic characters as "?" or
similar. Which I think happens in practice for some implementations.

Jens

Received on 2021-03-02 14:38:32