sg16: Re: [SG16] Is the concept of basic execution character sets useful?

From: Jean-Marc Bourguet <jm_at_[hidden]>
Date: Sat, 30 Jan 2021 17:38:06 +0100

Hi all,

Le 30/01/2021 à 12:18, Corentin via SG16 a écrit :
>
>
> On Sat, Jan 30, 2021 at 5:39 AM Hubert Tong
> <hubert.reinterpretcast_at_[hidden]
> <mailto:hubert.reinterpretcast_at_[hidden]>> wrote:
>
> On Wed, Jan 27, 2021 at 3:57 AM Corentin via SG16
> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
> Hello,
>
> Very quick reminder, using C++20 terminology
> We have:
>
> - basic source character set, which, while of limited use in
> the core language is used quite a bit in the library as a
> proxy for "displayable characters available in all
> encodings", which removal would then be slightly more involved.
>
> - The execution character set(s) which describe
> actual character sets used during evaluation and are therefore
> necessary.
>
> - The basic execution character set, which is a super set of
> the basic source character set
> and a subset of all execution character sets.
>
> It's strictly basic source character set + alert + backspace
> + carriage return + NULL
>
> Nowhere is it used in the library.
> It is not used in the core language either, except of course
> that we need to prescribe that NULL is encoded as 0 and that
> digits are encoded sequentially.
>
> While alert + backspace + carriage return are mentioned in
> escape sequences, if a theoretical encoding would miss these
> characters, there would be no further ill-effect on the
> behavior of the standard.
>
> The main change on top of the C++20 wording would be as follow
>
> The basic execution character set and the basic execution
> wide-character set shall each contain all the members of the
> basic source character set, plus control characters
> representing alert, backspace, and carriage return, plus
> a null character (respectively, null wide character), whose
> value is 0. For each basic execution character set, the values
> of the members shall be non-negative and distinct from one
> another. In both the source and execution basic character sets,
>
> You missed a "basic" as applied to "execution character set" here.
>
> the value of each character after 0 in the above list of
> decimal digits shall be one greater than the value of the
> previous. The execution character set and the execution
> wide-character set are implementation-defined supersets of the
> basic execution character set and the basic execution
> wide-character set, respectively. The values of the members of
> the execution character sets and the sets of additional
> members are locale-specific.
>
> Any reason why we should not do this?
>
> Because the above does not update [intro.memory] and leaves a
> dangling reference to the meaning of "basic execution character set".
>
>
> Are you talking about 3.35 [defns..multibyte] ?
> > sequence of one or more bytes representing a member of the extended
> character set of either the source or the execution environment
> [Note 1: The extended character set is a superset of the basic
> character set ([lex.charset]). — end note]
>
> If so, sorry I miss that, and yes that would need rewriting, good
> catch, thanks!
>
> Also, the above wording is currently meant to say (in part) that
> the characters required as members of the basic execution
> character sets, when encoded within a "narrow" possibly-multibyte
> string in any execution coded character set supported by the
> implementation, are single bytes whose value as read via a glvalue
> of type `char` is positive. The proposal seems to leave the
> relevant sentence in a sad state.
>
>
> I don't think my proposed change (which I meant to be more
> illustrative) does alter the current meaning significantly. If it
> does, I am not seeing it.
> If you are saying that this could benefit from a more extensive rewrite?
> Because I think I'd agree with that.
> Maybe listing all the requirements more explicitly?
>
> ------
>
> The execution character set
> <http://eel.is/c++draft/lex.charset#def:character_set,basic_execution> and
> wide execution character set
> <http://eel.is/c++draft/lex.charset#def:character_set,basic_execution> are
> implementation-defined character encodings such that:

Years ago -- a decade or so -- I convinced myself that using narrow
character set and wide character set was misleading as my comprehension
was that the intent was to have one character set per locale with a
narrow encoding (potentially multi-byte, potentially stateful) and a
wide encoding where each code point was represented by one code unit. If
we are reformulating that area, it may be worthwhile to state what is
desired here in this respect (either my interpretation or the current
intent at the time of the re-formulation).

And obviously there is a third encoding which is almost never mentioned:
the one used for IO. I think the whole specification was intended to
allow an external representation which didn't not respect the
constraints neither for a narrow nor for a wide encoding (think about
UTF-16: the byte zero may appear when no NUL is intended, yet it is a
multi-code unit encoding so not a wide encoding).

> * Each code unit is represented by a single char or wchar_t respectively
> * Each codepoint is represented by one or more code units.
>
One code unit for the wide character set.
>
> * Each member of the basic character set is uniquely represented by
> a single byte whose value, as read via a glvalue of type `char`,
> is positive
>
In the initial shift state. I don't remember there were any constraints
on what happened in other shift state excepted that a single byte of
value 0 is the null character independent of the shift state.
>
> * The NULL character (U+0000) is represented as a single code unit
> whose value is 0
> * The code units representing each digit in the basic character set
> (U+0030 to U+0030) have consecutive values
>
> -----
>
> I am very aware that this is extremely clunky, I'm a long way from
> being able to write good core wording, and I am sorry for that.
> Hopefully you get the idea
>
> Reasoning:
>
> * Unfortunately, this paragraph is trying to describe properties of
> the encoded code units rather than the code points. And because I
> don't think we care about the actual code points values anywhere
> (I'd have to double check) it might be better to describe
> encodings rather than character sets (The requirements on the
> value of digits, NULL and basic character sets elements apply to
> the encoded form, not the codepoints - or maybe it needs to apply
> to both, I'm not sure). Of course this change would have further
> modification on the rest of the wording.
>
ISO 2022 (AFAIK equivalent to ECMA-35 which is a document more easily
obtainable) was pertinent at the time of design (I don't know if it
still is) and doesn't really give a codepoint to character. That may
have had an influence.
>
> * An alternative would be to describe separately the encoding and
> the character set. I am not sure this is useful given there is
> only one encoding associated with each character set, so
> describing the encoding is enough to describe the character set -
> in other words, the existence of an execution character encoding
> admits the existence of an execution character set; the reverse is
> however not the case.
> * I've replaced in the above wording 'locale specific" by
> "implementation-defined", which I think is more accurate of how
> the encoding is determined by compilers during translation, even
> if the encoding may depend on the system derived from locale
> during execution. This may be a longer discussion though :)
>
>
> What do you think ?

Library functions may have a different result for a given value if you
switch locale (on Linux see the result wcrtomb for 0x20AC in
fr_FR.ISO-8859-1 and fr_FR.ISO-8859-9_at_[hidden] for instance). Giving the
impression that there is no such dependency of the locale seems
misleading more than clarifying.

Obviously the fact that Linux is using Unicode encoding as wide encoding
for all character sets and Windows and IBM are using 16-bit wchar_t put
some stress on the intended model as I understand it.

Yours,

-- Jean-Marc

Received on 2021-01-30 10:38:16