sg16: Re: [SG16] Is the concept of basic execution character sets useful?

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Sat, 30 Jan 2021 18:02:48 +0100

On Sat, Jan 30, 2021 at 5:38 PM Jean-Marc Bourguet via SG16 <
sg16_at_[hidden]> wrote:

> Hi all,
> Le 30/01/2021 à 12:18, Corentin via SG16 a écrit :
>
>
>
> On Sat, Jan 30, 2021 at 5:39 AM Hubert Tong <
> hubert.reinterpretcast_at_[hidden]> wrote:
>
>> On Wed, Jan 27, 2021 at 3:57 AM Corentin via SG16 <sg16_at_[hidden]>
>> wrote:
>>
>>> Hello,
>>>
>>> Very quick reminder, using C++20 terminology
>>> We have:
>>>
>>> - basic source character set, which, while of limited use in the core
>>> language is used quite a bit in the library as a proxy for "displayable
>>> characters available in all encodings", which removal would then be
>>> slightly more involved.
>>>
>>> - The execution character set(s) which describe actual character sets
>>> used during evaluation and are therefore necessary.
>>>
>>> - The basic execution character set, which is a super set of the basic
>>> source character set
>>> and a subset of all execution character sets.
>>>
>>> It's strictly basic source character set + alert + backspace +
>>> carriage return + NULL
>>>
>>> Nowhere is it used in the library.
>>> It is not used in the core language either, except of course that we
>>> need to prescribe that NULL is encoded as 0 and that digits are encoded
>>> sequentially.
>>>
>>> While alert + backspace + carriage return are mentioned in escape
>>> sequences, if a theoretical encoding would miss these characters, there
>>> would be no further ill-effect on the behavior of the standard.
>>>
>>> The main change on top of the C++20 wording would be as follow
>>>
>>> The basic execution character set and the basic execution
>>> wide-character set shall each contain all the members of the basic source
>>> character set, plus control characters representing alert, backspace,
>>> and carriage return, plus a null character (respectively, null wide
>>> character), whose value is 0. For each basic execution character set,
>>> the values of the members shall be non-negative and distinct from one
>>> another. In both the source and execution basic character sets,
>>>
>> You missed a "basic" as applied to "execution character set" here.
>>
>>
>>> the value of each character after 0 in the above list of decimal digits
>>> shall be one greater than the value of the previous. The execution
>>> character set and the execution wide-character
>>> set are implementation-defined supersets of the basic execution character
>>> set and the basic execution wide-character set, respectively. The
>>> values of the members of the execution character sets and the sets of
>>> additional members are locale-specific.
>>>
>>> Any reason why we should not do this?
>>>
>> Because the above does not update [intro.memory] and leaves a dangling
>> reference to the meaning of "basic execution character set".
>>
>
> Are you talking about 3.35 [defns..multibyte] ?
> > sequence of one or more bytes representing a member of the extended
> character set of either the source or the execution environment
> [Note 1: The extended character set is a superset of the basic character
> set ([lex.charset]). — end note]
>
> If so, sorry I miss that, and yes that would need rewriting, good catch,
> thanks!
>
>
>> Also, the above wording is currently meant to say (in part) that the
>> characters required as members of the basic execution character sets, when
>> encoded within a "narrow" possibly-multibyte string in any execution coded
>> character set supported by the implementation, are single bytes whose value
>> as read via a glvalue of type `char` is positive. The proposal seems to
>> leave the relevant sentence in a sad state.
>>
>
> I don't think my proposed change (which I meant to be more illustrative)
> does alter the current meaning significantly. If it does, I am not
> seeing it.
> If you are saying that this could benefit from a more extensive rewrite?
> Because I think I'd agree with that.
> Maybe listing all the requirements more explicitly?
>
> ------
>
> The execution character set
> <http://eel.is/c++draft/lex.charset#def:character_set,basic_execution> and
> wide execution character set
> <http://eel.is/c++draft/lex.charset#def:character_set,basic_execution> are
> implementation-defined character encodings such that:
>
> Years ago -- a decade or so -- I convinced myself that using narrow
> character set and wide character set was misleading as my comprehension was
> that the intent was to have one character set per locale with a narrow
> encoding (potentially multi-byte, potentially stateful) and a wide encoding
> where each code point was represented by one code unit. If we are
> reformulating that area, it may be worthwhile to state what is desired here
> in this respect (either my interpretation or the current intent at the time
> of the re-formulation).
>
> And obviously there is a third encoding which is almost never mentioned:
> the one used for IO. I think the whole specification was intended to allow
> an external representation which didn't not respect the constraints neither
> for a narrow nor for a wide encoding (think about UTF-16: the byte zero may
> appear when no NUL is intended, yet it is a multi-code unit encoding so not
> a wide encoding).
>
>
> - Each code unit is represented by a single char or wchar_t
> respectively
> - Each codepoint is represented by one or more code units.
>
> One code unit for the wide character set.
>

This is a bug in the current specification which does not match existing
practice - see https://github.com/sg16-unicode/sg16/issues/9
Although you are right that we should be careful in what ramifications
fixing it would have, thanks

>
> - Each member of the basic character set is uniquely represented by a
> single byte whose value, as read via a glvalue of type `char`, is
> positive
>
> In the initial shift state. I don't remember there were any constraints on
> what happened in other shift state excepted that a single byte of value 0
> is the null character independent of the shift state.
>
>
> - The NULL character (U+0000) is represented as a single code unit
> whose value is 0
> - The code units representing each digit in the basic character set
> (U+0030 to U+0030) have consecutive values
>
> -----
>
> I am very aware that this is extremely clunky, I'm a long way from being
> able to write good core wording, and I am sorry for that.
> Hopefully you get the idea
>
> Reasoning:
>
>
> - Unfortunately, this paragraph is trying to describe properties of
> the encoded code units rather than the code points. And because I don't
> think we care about the actual code points values anywhere (I'd have to
> double check) it might be better to describe encodings rather than
> character sets (The requirements on the value of digits, NULL and basic
> character sets elements apply to the encoded form, not the codepoints - or
> maybe it needs to apply to both, I'm not sure). Of course this change would
> have further modification on the rest of the wording.
>
> ISO 2022 (AFAIK equivalent to ECMA-35 which is a document more easily
> obtainable) was pertinent at the time of design (I don't know if it still
> is) and doesn't really give a codepoint to character. That may have had an
> influence.
>

For most non-Unicode encodings, there is no distinction between code units
sequences and code points, because most non-Unicode character sets are
represented by a single encoding.
The notion of code points is useful when a given character set is
represented by more encoding (utf 8, utf 16, utf 32, etc for unicode)

>
> - An alternative would be to describe separately the encoding and the
> character set. I am not sure this is useful given there is only one
> encoding associated with each character set, so describing the encoding is
> enough to describe the character set - in other words, the existence of an
> execution character encoding admits the existence of an execution character
> set; the reverse is however not the case.
> - I've replaced in the above wording 'locale specific" by
> "implementation-defined", which I think is more accurate of how the
> encoding is determined by compilers during translation, even if the
> encoding may depend on the system derived from locale during execution.
> This may be a longer discussion though :)
>
>
> What do you think ?
>
> Library functions may have a different result for a given value if you
> switch locale (on Linux see the result wcrtomb for 0x20AC in
> fr_FR.ISO-8859-1 and fr_FR.ISO-8859-9_at_[hidden] for instance). Giving the
> impression that there is no such dependency of the locale seems misleading
> more than clarifying.
>

Some longer rationale for that in wg21.link/P2020 wg21.link/P1885
The relation between the encoding chosen by the compiler and the one
inferred by library functions / the environment needs further clarification.

>
> Obviously the fact that Linux is using Unicode encoding as wide encoding
> for all character sets and Windows and IBM are using 16-bit wchar_t put
> some stress on the intended model as I understand it.
>
> Yours,
>
> -- Jean-Marc
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2021-01-30 11:03:02