C++ Logo


Advanced search

Re: [SG16] Towards a better description of the execution encoding

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Mon, 1 Mar 2021 17:16:47 -0500
On Mon, Mar 1, 2021 at 11:27 AM Steve Downey via SG16 <sg16_at_[hidden]>

> Could we perhaps make use of the encoding used by the "C" locale to talk
> about how the "encoding of the execution character set" is meant to be
> interpreted? http://eel.is/c++draft/lex.ccon#2
> Execution encoding isn't currently used in the standard as that exact
> phrase, although lex.ccon does come close, as does
> http://eel.is/c++draft/tab:lex.string.literal
The sentence above is using "encoding of the execution character set" for
its position in the status quo of the working draft, right? That is, we
should read it as saying that the "literal encoding" can be taken as the
locale-specific encoding used in the C locale. In practice, that's not true
(e.g., literals encoded as UTF-8 on systems with a C locale using
US-ASCII). What is probably true is that the encoding difference is not
observable if only characters from the basic execution character set are
used. I think it is safe to say that there are seriously many
scripts/programs that (ab)use text processing facilities via the property
that "C" locales basically treat bytes as characters.

> Incidentally http://eel.is/c++draft/fs.req.general#4 [*Note 1
> <http://eel.is/c++draft/full#fs.req.general-note-1>*:
> Use of an encoded character type implies an associated character set and
> encoding. <http://eel.is/c++draft/full#fs.req.general-4.sentence-1>
> Since signed char and unsigned char have no implied character set and
> encoding, they are not included as permitted types.
> <http://eel.is/c++draft/full#fs.req.general-4.sentence-2>
> — *end note*]
> is contradicted by lex.ccon.
> On Mon, Mar 1, 2021 at 10:24 AM Corentin via SG16 <sg16_at_[hidden]>
> wrote:
>> Hey folks!
>> Last meeting we talked about the relation between the literal & execution
>> encoding.
>> I think there is pressure to solve this issue (encoding names,
>> std::print, other features).
>> In P2297, I suggested that we say the execution character set is a
>> superset of the literal character set, such that any character in the
>> literal character set results in the same code unit sequence
>> whether it is encoded in the literal encoding or execution encoding.
>> Hubert was concerned this was too restrictive because some ebcdic &
>> iso 646 have codepoints reserved for "national symbols".
>> Even Shift-JIS is not 100% ascii compatible (Yen instead of backslash,
>> overline instead of tilde)
>> I've been thinking about that over the past few days, I think the
>> solution is to not have requirements on the literal character set but
>> rather on the literals themselves.
>> If the execution encoding is UTF8, "ABC" is interpreted identically
>> whether its encoding is ASCII, ISO 646-IT, or Shift-JS.
>> However, "C:\\" would be interpreted as "C:\\", "C:ç" and "C:¥"
>> respectively.
>> So we need to only put requirements on the content of individual literals
>> rather than on the entiere literal set (which, P1885 non whistanding, is
>> not observable during execution anyhow)
>> *A way to word that:*
>> The execution encoding is the locale-specific encoding used to interpret
>> character and NTMBS parameters in character functions, multibyte characters
>> functions and other locale-specific functions.
>> If character literals and string literals used as arguments to character
>> functions and locale specific functions do not represent the same sequence
>> of abstract characters whether they are interpreted with the literal
>> encoding or the execution encoding the behavior is undefined.
>> I hope that this resolves Hubert concerns and that we can refine the
>> general idea and put that in a paper :)
>> Have a great week,
>> Corentin
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16

Received on 2021-03-01 16:17:17