Hi all,

Le 30/01/2021 à 12:18, Corentin via SG16 a écrit :

On Sat, Jan 30, 2021 at 5:39 AM Hubert Tong <hubert.reinterpretcast@gmail.com> wrote:
On Wed, Jan 27, 2021 at 3:57 AM Corentin via SG16 <sg16@lists.isocpp.org> wrote:

Very quick reminder, using C++20 terminology
We have:

- basic source character set, which, while of limited use in the core language is used quite a bit in the library as a proxy for  "displayable characters available in all encodings", which removal would then be slightly more involved.

- The execution character set(s) which describe actual character sets used during evaluation and are therefore necessary.

- The basic execution character set, which is a super set of the basic source character set
and a subset of all execution character sets.

It's strictly basic source character set +  alert +  backspace + carriage return + NULL

Nowhere is it used in the library.
It is not used in the core language either, except of course that we need to prescribe that NULL is encoded as 0 and that digits are encoded sequentially.

While  alert +  backspace + carriage return are mentioned in escape sequences, if a theoretical encoding would miss these characters, there would be no further ill-effect on the behavior of the standard.

The main change on top of the C++20 wording would be as follow

The basic execution character set and the basic execution wide-character set shall each contain all the members of the basic source character set, plus control characters representing alert, backspace, and carriage return, plus a null character (respectively, null wide character), whose value is 0. For each basic execution character set, the values of the members shall be non-negative and distinct from one another. In both the source and execution basic character sets,
You missed a "basic" as applied to "execution character set" here.
the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous. The execution character set and the execution wide-character set are implementation-defined supersets of the basic execution character set and the basic execution wide-character set, respectively. The values of the members of the execution character sets and the sets of additional members are locale-specific.

Any reason why we should not do this?
Because the above does not update [intro.memory] and leaves a dangling reference to the meaning of "basic execution character set".

Are you talking about 3.35 [defns..multibyte] ?
> sequence of one or more bytes representing a member of the extended character set of either the source or the execution environment
[Note 1: The extended character set is a superset of the basic character set ([lex.charset]). — end note]

If so, sorry I miss that, and yes that would need rewriting, good catch, thanks!
Also, the above wording is currently meant to say (in part) that the characters required as members of the basic execution character sets, when encoded within a "narrow" possibly-multibyte string in any execution coded character set supported by the implementation, are single bytes whose value as read via a glvalue of type `char` is positive. The proposal seems to leave the relevant sentence in a sad state.

I don't think my proposed change (which I meant to be more illustrative) does alter the current meaning significantly. If it does, I am not seeing it.
If you are saying that this could benefit from a more extensive rewrite?
Because I think I'd agree with that.
Maybe listing all the requirements more explicitly?


The execution character set and wide execution character set are implementation-defined character encodings such that:

Years ago -- a decade or so -- I convinced myself that using narrow character set and wide character set was misleading as my comprehension was that the intent was to have one character set per locale with a narrow encoding (potentially multi-byte, potentially stateful) and a wide encoding where each code point was represented by one code unit. If we are reformulating that area, it may be worthwhile to state what is desired here in this respect (either my interpretation or the current intent at the time of the re-formulation).

And obviously there is a third encoding which is almost never mentioned: the one used for IO. I think the whole specification was intended to allow an external representation which didn't not respect the constraints neither for a narrow nor for a wide encoding (think about UTF-16: the byte zero may appear when no NUL is intended, yet it is a multi-code unit encoding so not a wide encoding).

  • Each code unit is represented by a single char or wchar_t respectively
  • Each codepoint is represented by one or more code units.
One code unit for the wide character set.
  • Each member of the basic character set is uniquely represented by a single byte whose value, as read via a glvalue of type `char`, is positive
In the initial shift state. I don't remember there were any constraints on what happened in other shift state excepted that a single byte of value 0 is the null character independent of the shift state.
  • The NULL character (U+0000) is represented as a single code unit whose value is 0
  • The code units representing each digit in the basic character set (U+0030 to U+0030) have consecutive values

I am very aware that this is extremely clunky, I'm a long way from being able to write good core wording, and I am sorry for that.
Hopefully you get the idea


  • Unfortunately, this paragraph is trying to describe properties of the encoded code units rather than the code points. And because I don't think we care about the actual code points values anywhere (I'd have to double check) it might be better to describe encodings rather than character sets (The requirements on the value of digits, NULL and basic character sets elements apply to the encoded form, not the codepoints - or maybe it needs to apply to both, I'm not sure). Of course this change would have further modification on the rest of the wording.
ISO 2022 (AFAIK equivalent to ECMA-35 which is a document more easily obtainable) was pertinent at the time of design (I don't know if it still is) and doesn't really give a codepoint to character. That may have had an influence.
  • An alternative would be to describe separately the encoding and the character set. I am not sure this is useful given there is only one encoding associated with each character set, so describing the encoding is enough to describe the character set - in other words, the existence of an execution character encoding admits the existence of an execution character set; the reverse is however not the case.
  • I've replaced in the above wording 'locale specific" by "implementation-defined", which I think is more accurate of how the encoding is determined by compilers during translation, even if the encoding may depend on the system derived from locale during execution. This may be a longer discussion though :)

What do you think ?

Library functions may have a different result for a given value if you switch locale (on Linux see the result wcrtomb for 0x20AC in fr_FR.ISO-8859-1 and fr_FR.ISO-8859-9@euro for instance). Giving the impression that there is no such dependency of the locale seems misleading more than clarifying.

Obviously the fact that Linux is using Unicode encoding as wide encoding for all character sets and Windows and IBM are using 16-bit wchar_t put some stress on the intended model as I understand it.


-- Jean-Marc