sg16: [SG16] Reading the tea leaves: What is the execution encoding in C?

From: Corentin <corentin.jabot_at_[hidden]>
Date: Tue, 23 Feb 2021 10:42:43 +0100

Hello,
I have a question for C experts as to the intended meaning of "execution
encoding".

During translation,

Each source character set member and escape sequence in character constants
and string literals is converted to the corresponding member of the
execution character set; if there is no corresponding member, it is
converted to an implementation-defined member other than the null (wide)
character

5.21
Two sets of characters and their associated collating sequences collating
sequences shall be defined: the set in which source files are written (the
source character set), and the set interpreted in the execution environment
(the execution character set). Each set is further divided into a basic
character set, whose contents are given by this subclause, and a set of
zero or more locale-specific members (which are not members of the basic
character set) called extended characters. The combined set is also called
the extended character set. The values of the members of the execution
character set are implementation-defined.

During execution,

5.2.2 Alphabetic escape sequences representing nongraphic characters in the
execution character set are intended to produce actions on display devices
as follows

The wording of ctype.h functions use the term "character" without
specifying what the associated encoding is presumed to be.
------

C++ has the same lack of clarity.
As such, C++ will hopefully shift to "literal character set"/"literal
character encoding" to describe the encoding of string & character
literals.

The question then is what the intended behavior of, for example
"isalpha('a')" is if the literal and execution encoding differ (say one is
ascii the other ebcdic).

Is the intent that:

   - C assumes 'a' is a character in the environment execution encoding -
   and presumably its UB if it isn't
   - C is perfectly happy saying that isalpha('a') is false

Would say have different questions for characters outside of the basic
character sets
say isalpha('é') assuming iso 8859-1 literal encoding (Latin Small Letter E
with Acute, in case the mailing list butchers the text, the irony of which
is delightful).
What about putc('\\') if an encoding is ASCII and the other Shift-JIS ?

In other words, is there the intent that there exist a relation between the
literal and execution encodings (the later of which may be affected by
local).
Is the "execution encoding" the encoding assumed by locale.h/stdlib.h
functions?
I don't think explicitly stated either, the wording mentions these
functions accepting "character"s without stating the presumed encoding of
these characters.

There are the following definitions

multibyte character
sequence of one or more bytes representing a member of the extended
character set of either the source or the execution environment

wide character
value representable by an object of type wchar_t, capable of representing
any character in the current locale

But it is unclear whether they apply to the language or library. And of
course, ctypes functions do not accept multibytes characters!

As a user, I would expect a precondition that the environment
execution encoding is a super set of the literal execution, but it is
unclear to me whether that's stated or intended.

I really hope you can shed light on the original intent and history! Thanks
Corentin

Received on 2021-02-23 03:42:56