sg16: Re: [SG16] Is the concept of basic execution character sets useful?

From: Corentin <corentin.jabot_at_[hidden]>
Date: Sat, 30 Jan 2021 18:25:13 +0100

On Sat, Jan 30, 2021 at 5:54 PM Hubert Tong <
hubert.reinterpretcast_at_[hidden]> wrote:

> On Sat, Jan 30, 2021 at 6:18 AM Corentin <corentin.jabot_at_[hidden]> wrote:
>
>>
>>
>> On Sat, Jan 30, 2021 at 5:39 AM Hubert Tong <
>> hubert.reinterpretcast_at_[hidden]> wrote:
>>
>>> On Wed, Jan 27, 2021 at 3:57 AM Corentin via SG16 <sg16_at_[hidden]>
>>> wrote:
>>>
>>>>
>>>> The main change on top of the C++20 wording would be as follow
>>>>
>>>> The basic execution character set and the basic execution
>>>> wide-character set shall each contain all the members of the basic source
>>>> character set, plus control characters representing alert, backspace,
>>>> and carriage return, plus a null character (respectively, null wide
>>>> character), whose value is 0. For each basic execution character set,
>>>> the values of the members shall be non-negative and distinct from one
>>>> another. In both the source and execution basic character sets,
>>>>
>>> You missed a "basic" as applied to "execution character set" here.
>>>
>>>
>>>> the value of each character after 0 in the above list of decimal digits
>>>> shall be one greater than the value of the previous. The execution
>>>> character set and the execution wide-character
>>>> set are implementation-defined supersets of the basic execution character
>>>> set and the basic execution wide-character set, respectively. The
>>>> values of the members of the execution character sets and the sets of
>>>> additional members are locale-specific.
>>>>
>>>> Any reason why we should not do this?
>>>>
>>> Because the above does not update [intro.memory] and leaves a dangling
>>> reference to the meaning of "basic execution character set".
>>>
>>
>> Are you talking about 3.35 [defns..multibyte] ?
>> > sequence of one or more bytes representing a member of the extended
>> character set of either the source or the execution environment
>> [Note 1: The extended character set is a superset of the basic character
>> set ([lex.charset]). — end note]
>>
>> If so, sorry I miss that, and yes that would need rewriting, good catch,
>> thanks!
>>
> That wasn't the reference I meant, although I guess there is some action
> to be taken with that note too. I doubt that we have any need to talk about
> a character being multibyte in the source code encoding. As for the actual
> term of "multibyte" itself: I think this particular term is not just a
> wording detail, so I am not inclined to rename it.
>

The term multibyte is fine, I'd change the definition to something like
"sequence of one or more code units representing a member of the character
set of either the source or the execution environment" - aka just removing
"extended"

>
> The reference I had meant was [intro.memory] p1:
> A byte is at least large enough to contain any member of the basic
> execution character set [ ... ]
>

Oh, missed that, thanks.

>
>
>>
>>
>>> Also, the above wording is currently meant to say (in part) that the
>>> characters required as members of the basic execution character sets, when
>>> encoded within a "narrow" possibly-multibyte string in any execution coded
>>> character set supported by the implementation, are single bytes whose value
>>> as read via a glvalue of type `char` is positive. The proposal seems to
>>> leave the relevant sentence in a sad state.
>>>
>>
>> I don't think my proposed change (which I meant to be more illustrative)
>> does alter the current meaning significantly. If it does, I am not
>> seeing it.
>>
> The meaning is currently possible to make sense of because the "value" can
> be understood to be that of a single code unit. By instead talking about
> members of the execution (coded) character set in general, the requirement
> upon the "value" becomes abstract in relation to the specification (because
> the method to observe said value is unclear for multibyte characters or in
> encodings of the coded character set that do not map the value in a direct
> manner).
>
>
>> If you are saying that this could benefit from a more extensive rewrite?
>>
>

> I think what we're seeing here is that the status quo "works" due to being
> uniformly fuzzy. This is somewhat analogous to why we needed Davis's name
> lookup rewrite.
>

+1

>
>
>> Because I think I'd agree with that.
>> Maybe listing all the requirements more explicitly?
>>
>> ------
>>
>> The execution character set
>> <http://eel.is/c++draft/lex.charset#def:character_set,basic_execution> and
>> wide execution character set
>> <http://eel.is/c++draft/lex.charset#def:character_set,basic_execution> are
>> implementation-defined character encodings such that:
>>
>

> I really think we need to rename the above terms. In the definition here,
> we're beyond the coded character set level and at the level of an encoding
> form...
>

+1

>
>> - Each code unit is represented by a single char or wchar_t
>> respectively
>> - Each codepoint is represented by one or more code units.
>>
>> Has consensus been found that UTF-16 is a valid wide execution character
> set (encoding)? Are there general library facilities to handle conversion
> of strings from wide execution character set (encoding)s with characters
> that are encoded in more than one wchar_t code unit to other encodings?
>

I don't think so but the status quo does not match existing practice.
I think we could easily fix the core language but we may have to modify the
library wording because I think some functions can't deal with wide
multi-byte? Not sure https://github.com/sg16-unicode/sg16/issues/9
Paper needed :)

>
>> - Each member of the basic character set is uniquely represented by a
>> single byte whose value, as read via a glvalue of type `char`, is
>> positive
>>
>> I think "basic character set" above isn't just the basic source character
> set. I think "basic execution character set" as a term happens to be the
> right name for what we need (just that the current definition is not what
> we want; we don't want a coded character set, and there shouldn't be a
> "narrow" and wide version). Also: The second half should read "... single
> code unit whose value is positive" now that you've defined the code units
> appropriately for that to work.
>

I meant source here
What we want some characters a subset of (0+0000-U+0127) to always be
1/representable 2/representable in 1 code unit. The indirection doesn't
serve much purpose (Unless I am missing something).
That we are talking of character set instead of encoding here make it very
confusing

>
>> - The NULL character (U+0000) is represented as a single code unit
>> whose value is 0
>> - The code units representing each digit in the basic character set
>> (U+0030 to U+0030) have consecutive values
>>
>> Typo: U+0039
>

indeed!

>
>
>> -----
>>
>> I am very aware that this is extremely clunky, I'm a long way from being
>> able to write good core wording, and I am sorry for that.
>>
> I think the wording works for what it does say. Perhaps it doesn't say
> everything it needs to (see below re: locale-specific).
>
>
>> Hopefully you get the idea
>>
>> Reasoning:
>>
>>
>> - Unfortunately, this paragraph is trying to describe properties of
>> the encoded code units rather than the code points. And because I don't
>> think we care about the actual code points values anywhere (I'd have to
>> double check) it might be better to describe encodings rather than
>> character sets (The requirements on the value of digits, NULL and basic
>> character sets elements apply to the encoded form, not the codepoints - or
>> maybe it needs to apply to both, I'm not sure). Of course this change would
>> have further modification on the rest of the wording.
>>
>> I agree and had similar thoughts (specifically, ones mentioned in my
> in-line replies above). As for the further modification, my outlook on that
> is further below.
>
>>
>> - An alternative would be to describe separately the encoding and the
>> character set. I am not sure this is useful given there is only one
>> encoding associated with each character set, so describing the encoding is
>> enough to describe the character set - in other words, the existence of an
>> execution character encoding admits the existence of an execution character
>> set; the reverse is however not the case.
>>
>> The "basic" portions of each are character sets separate from an
> associated code value and encoding form. Whether or not we choose to call
> them character sets is a matter of convenience for the specification.
>
>>
>> - I've replaced in the above wording 'locale specific" by
>> "implementation-defined", which I think is more accurate of how the
>> encoding is determined by compilers during translation, even if the
>> encoding may depend on the system derived from locale during execution.
>> This may be a longer discussion though :)
>>
>> Thanks for pointing this out explicitly. I think we have to leave the
> "locale-specific" around somewhere.
>
> The additional things that the current wording is probably trying to say
> are:
> In the execution environment, the library operates using locale-specific
> encodings for wide strings and byte strings.
> The characters in the basic execution character set shall be represented
> in each locale-specific encoding.
>

I think we want to say ( to match existing practice ), that the execution
environment has an encoding / character set that is either the same or a
super set of the execution character set (same values but may have extra
members).
It is unclear that "local specific" currently say that.

>
>
>>
>> What do you think ?
>>
> My current impression is that there may be a narrow-enough scope here that
> a separate paper could come out of this thread without pulling in the world.
>

Agreed.
We may want to leave the local-specific part out of this paper to contain
the scope.
We may have to resolve the wchar_t par first though

>
>
>>
>>
>>
>>
>>
>>
>>>
>>> That said, the idea of basic execution character sets (a "narrow" one
>>> and a wide one) for which the characters have (encoding) values somewhat
>>> implies but fails to really say certain things that are not true. We are,
>>> therefore, indeed better off with shifting the talk of encoding values to
>>> the locale-specific execution narrow/wide coded character sets. In other
>>> words, I think the homework here is to be better at saying "coded character
>>> set" when we want to.
>>>
>>>
>>>>
>>>> (As always, I'm interested in having a simple model with no
>>>> unnecessary terminology as, as observed these past few months, it has a
>>>> tendency to hinder our collective understanding)
>>>>
>>>> Corentin
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> SG16 mailing list
>>>> SG16_at_[hidden]
>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>>
>>>

Received on 2021-01-30 11:25:28