sg16: Re: [SG16] Is the concept of basic execution character sets useful?

From: Corentin <corentin.jabot_at_[hidden]>
Date: Sat, 30 Jan 2021 12:18:20 +0100

On Sat, Jan 30, 2021 at 5:39 AM Hubert Tong <
hubert.reinterpretcast_at_[hidden]> wrote:

> On Wed, Jan 27, 2021 at 3:57 AM Corentin via SG16 <sg16_at_[hidden]>
> wrote:
>
>> Hello,
>>
>> Very quick reminder, using C++20 terminology
>> We have:
>>
>> - basic source character set, which, while of limited use in the core
>> language is used quite a bit in the library as a proxy for "displayable
>> characters available in all encodings", which removal would then be
>> slightly more involved.
>>
>> - The execution character set(s) which describe actual character sets
>> used during evaluation and are therefore necessary.
>>
>> - The basic execution character set, which is a super set of the basic
>> source character set
>> and a subset of all execution character sets.
>>
>> It's strictly basic source character set + alert + backspace + carriage
>> return + NULL
>>
>> Nowhere is it used in the library.
>> It is not used in the core language either, except of course that we need
>> to prescribe that NULL is encoded as 0 and that digits are encoded
>> sequentially.
>>
>> While alert + backspace + carriage return are mentioned in escape
>> sequences, if a theoretical encoding would miss these characters, there
>> would be no further ill-effect on the behavior of the standard.
>>
>> The main change on top of the C++20 wording would be as follow
>>
>> The basic execution character set and the basic execution wide-character
>> set shall each contain all the members of the basic source character set, plus
>> control characters representing alert, backspace, and carriage return, plus
>> a null character (respectively, null wide character), whose value is 0. For
>> each basic execution character set, the values of the members shall be
>> non-negative and distinct from one another. In both the source and
>> execution basic character sets,
>>
> You missed a "basic" as applied to "execution character set" here.
>
>
>> the value of each character after 0 in the above list of decimal digits
>> shall be one greater than the value of the previous. The execution
>> character set and the execution wide-character
>> set are implementation-defined supersets of the basic execution character
>> set and the basic execution wide-character set, respectively. The values
>> of the members of the execution character sets and the sets of additional
>> members are locale-specific.
>>
>> Any reason why we should not do this?
>>
> Because the above does not update [intro.memory] and leaves a dangling
> reference to the meaning of "basic execution character set".
>

Are you talking about 3.35 [defns..multibyte] ?
> sequence of one or more bytes representing a member of the extended
character set of either the source or the execution environment
[Note 1: The extended character set is a superset of the basic character
set ([lex.charset]). — end note]

If so, sorry I miss that, and yes that would need rewriting, good catch,
thanks!

> Also, the above wording is currently meant to say (in part) that the
> characters required as members of the basic execution character sets, when
> encoded within a "narrow" possibly-multibyte string in any execution coded
> character set supported by the implementation, are single bytes whose value
> as read via a glvalue of type `char` is positive. The proposal seems to
> leave the relevant sentence in a sad state.
>

I don't think my proposed change (which I meant to be more illustrative)
does alter the current meaning significantly. If it does, I am not
seeing it.
If you are saying that this could benefit from a more extensive rewrite?
Because I think I'd agree with that.
Maybe listing all the requirements more explicitly?

------

The execution character set
<http://eel.is/c++draft/lex.charset#def:character_set,basic_execution> and
wide execution character set
<http://eel.is/c++draft/lex.charset#def:character_set,basic_execution> are
implementation-defined character encodings such that:

   - Each code unit is represented by a single char or wchar_t respectively
   - Each codepoint is represented by one or more code units.
   - Each member of the basic character set is uniquely represented by a
   single byte whose value, as read via a glvalue of type `char`, is
   positive
   - The NULL character (U+0000) is represented as a single code unit whose
   value is 0
   - The code units representing each digit in the basic character set
   (U+0030 to U+0030) have consecutive values

-----

I am very aware that this is extremely clunky, I'm a long way from being
able to write good core wording, and I am sorry for that.
Hopefully you get the idea

Reasoning:

   - Unfortunately, this paragraph is trying to describe properties of the
   encoded code units rather than the code points. And because I don't think
   we care about the actual code points values anywhere (I'd have to double
   check) it might be better to describe encodings rather than character sets
   (The requirements on the value of digits, NULL and basic character sets
   elements apply to the encoded form, not the codepoints - or maybe it needs
   to apply to both, I'm not sure). Of course this change would have further
   modification on the rest of the wording.
   - An alternative would be to describe separately the encoding and the
   character set. I am not sure this is useful given there is only one
   encoding associated with each character set, so describing the encoding is
   enough to describe the character set - in other words, the existence of an
   execution character encoding admits the existence of an execution character
   set; the reverse is however not the case.
   - I've replaced in the above wording 'locale specific" by
   "implementation-defined", which I think is more accurate of how the
   encoding is determined by compilers during translation, even if the
   encoding may depend on the system derived from locale during execution.
   This may be a longer discussion though :)

What do you think ?

>
> That said, the idea of basic execution character sets (a "narrow" one and
> a wide one) for which the characters have (encoding) values somewhat
> implies but fails to really say certain things that are not true. We are,
> therefore, indeed better off with shifting the talk of encoding values to
> the locale-specific execution narrow/wide coded character sets. In other
> words, I think the homework here is to be better at saying "coded character
> set" when we want to.
>
>
>>
>> (As always, I'm interested in having a simple model with no
>> unnecessary terminology as, as observed these past few months, it has a
>> tendency to hinder our collective understanding)
>>
>> Corentin
>>
>>
>>
>>
>>
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>

Received on 2021-01-30 05:18:34