sg16: Re: [SG16] Is the concept of basic execution character sets useful?

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Sat, 30 Jan 2021 11:54:31 -0500

On Sat, Jan 30, 2021 at 6:18 AM Corentin <corentin.jabot_at_[hidden]> wrote:

>
>
> On Sat, Jan 30, 2021 at 5:39 AM Hubert Tong <
> hubert.reinterpretcast_at_[hidden]> wrote:
>
>> On Wed, Jan 27, 2021 at 3:57 AM Corentin via SG16 <sg16_at_[hidden]>
>> wrote:
>>
>>>
>>> The main change on top of the C++20 wording would be as follow
>>>
>>> The basic execution character set and the basic execution
>>> wide-character set shall each contain all the members of the basic source
>>> character set, plus control characters representing alert, backspace,
>>> and carriage return, plus a null character (respectively, null wide
>>> character), whose value is 0. For each basic execution character set,
>>> the values of the members shall be non-negative and distinct from one
>>> another. In both the source and execution basic character sets,
>>>
>> You missed a "basic" as applied to "execution character set" here.
>>
>>
>>> the value of each character after 0 in the above list of decimal digits
>>> shall be one greater than the value of the previous. The execution
>>> character set and the execution wide-character
>>> set are implementation-defined supersets of the basic execution character
>>> set and the basic execution wide-character set, respectively. The
>>> values of the members of the execution character sets and the sets of
>>> additional members are locale-specific.
>>>
>>> Any reason why we should not do this?
>>>
>> Because the above does not update [intro.memory] and leaves a dangling
>> reference to the meaning of "basic execution character set".
>>
>
> Are you talking about 3.35 [defns..multibyte] ?
> > sequence of one or more bytes representing a member of the extended
> character set of either the source or the execution environment
> [Note 1: The extended character set is a superset of the basic character
> set ([lex.charset]). — end note]
>
> If so, sorry I miss that, and yes that would need rewriting, good catch,
> thanks!
>
That wasn't the reference I meant, although I guess there is some action to
be taken with that note too. I doubt that we have any need to talk about a
character being multibyte in the source code encoding. As for the actual
term of "multibyte" itself: I think this particular term is not just a
wording detail, so I am not inclined to rename it.

The reference I had meant was [intro.memory] p1:
A byte is at least large enough to contain any member of the basic
execution character set [ ... ]

>
>
>> Also, the above wording is currently meant to say (in part) that the
>> characters required as members of the basic execution character sets, when
>> encoded within a "narrow" possibly-multibyte string in any execution coded
>> character set supported by the implementation, are single bytes whose value
>> as read via a glvalue of type `char` is positive. The proposal seems to
>> leave the relevant sentence in a sad state.
>>
>
> I don't think my proposed change (which I meant to be more illustrative)
> does alter the current meaning significantly. If it does, I am not
> seeing it.
>
The meaning is currently possible to make sense of because the "value" can
be understood to be that of a single code unit. By instead talking about
members of the execution (coded) character set in general, the requirement
upon the "value" becomes abstract in relation to the specification (because
the method to observe said value is unclear for multibyte characters or in
encodings of the coded character set that do not map the value in a direct
manner).

> If you are saying that this could benefit from a more extensive rewrite?
>
I think what we're seeing here is that the status quo "works" due to being
uniformly fuzzy. This is somewhat analogous to why we needed Davis's name
lookup rewrite.

> Because I think I'd agree with that.
> Maybe listing all the requirements more explicitly?
>
> ------
>
> The execution character set
> <http://eel.is/c++draft/lex.charset#def:character_set,basic_execution> and
> wide execution character set
> <http://eel.is/c++draft/lex.charset#def:character_set,basic_execution> are
> implementation-defined character encodings such that:
>
I really think we need to rename the above terms. In the definition here,
we're beyond the coded character set level and at the level of an encoding
form...

>
> - Each code unit is represented by a single char or wchar_t
> respectively
> - Each codepoint is represented by one or more code units.
>
> Has consensus been found that UTF-16 is a valid wide execution character
set (encoding)? Are there general library facilities to handle conversion
of strings from wide execution character set (encoding)s with characters
that are encoded in more than one wchar_t code unit to other encodings?

>
> - Each member of the basic character set is uniquely represented by a
> single byte whose value, as read via a glvalue of type `char`, is
> positive
>
> I think "basic character set" above isn't just the basic source character
set. I think "basic execution character set" as a term happens to be the
right name for what we need (just that the current definition is not what
we want; we don't want a coded character set, and there shouldn't be a
"narrow" and wide version). Also: The second half should read "... single
code unit whose value is positive" now that you've defined the code units
appropriately for that to work.

>
> - The NULL character (U+0000) is represented as a single code unit
> whose value is 0
> - The code units representing each digit in the basic character set
> (U+0030 to U+0030) have consecutive values
>
> Typo: U+0039

> -----
>
> I am very aware that this is extremely clunky, I'm a long way from being
> able to write good core wording, and I am sorry for that.
>
I think the wording works for what it does say. Perhaps it doesn't say
everything it needs to (see below re: locale-specific).

> Hopefully you get the idea
>
> Reasoning:
>
>
> - Unfortunately, this paragraph is trying to describe properties of
> the encoded code units rather than the code points. And because I don't
> think we care about the actual code points values anywhere (I'd have to
> double check) it might be better to describe encodings rather than
> character sets (The requirements on the value of digits, NULL and basic
> character sets elements apply to the encoded form, not the codepoints - or
> maybe it needs to apply to both, I'm not sure). Of course this change would
> have further modification on the rest of the wording.
>
> I agree and had similar thoughts (specifically, ones mentioned in my
in-line replies above). As for the further modification, my outlook on that
is further below.

>
> - An alternative would be to describe separately the encoding and the
> character set. I am not sure this is useful given there is only one
> encoding associated with each character set, so describing the encoding is
> enough to describe the character set - in other words, the existence of an
> execution character encoding admits the existence of an execution character
> set; the reverse is however not the case.
>
> The "basic" portions of each are character sets separate from an
associated code value and encoding form. Whether or not we choose to call
them character sets is a matter of convenience for the specification.

>
> - I've replaced in the above wording 'locale specific" by
> "implementation-defined", which I think is more accurate of how the
> encoding is determined by compilers during translation, even if the
> encoding may depend on the system derived from locale during execution.
> This may be a longer discussion though :)
>
> Thanks for pointing this out explicitly. I think we have to leave the
"locale-specific" around somewhere.

The additional things that the current wording is probably trying to say
are:
In the execution environment, the library operates using locale-specific
encodings for wide strings and byte strings.
The characters in the basic execution character set shall be represented in
each locale-specific encoding.

>
> What do you think ?
>
My current impression is that there may be a narrow-enough scope here that
a separate paper could come out of this thread without pulling in the world.

>
>
>
>
>
>
>>
>> That said, the idea of basic execution character sets (a "narrow" one and
>> a wide one) for which the characters have (encoding) values somewhat
>> implies but fails to really say certain things that are not true. We are,
>> therefore, indeed better off with shifting the talk of encoding values to
>> the locale-specific execution narrow/wide coded character sets. In other
>> words, I think the homework here is to be better at saying "coded character
>> set" when we want to.
>>
>>
>>>
>>> (As always, I'm interested in having a simple model with no
>>> unnecessary terminology as, as observed these past few months, it has a
>>> tendency to hinder our collective understanding)
>>>
>>> Corentin
>>>
>>>
>>>
>>>
>>>
>>> --
>>> SG16 mailing list
>>> SG16_at_[hidden]
>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>
>>

Received on 2021-01-30 10:55:00