C++ Logo

sg16

Advanced search

Re: What does "Execution Character Set" refer to these days?

From: Steve Downey <sdowney_at_[hidden]>
Date: Thu, 28 Apr 2022 21:01:13 -0400
On Thu, Apr 28, 2022 at 4:57 PM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:

> On 28/04/2022 21.04, Steve Downey via SG16 wrote:
> > There are 5 remaining mentions of Execution Character Set in the draft:
>
> Did you check the C standard and the indirect impact it has
> on C++, given that we inherit quite a few library functions
> from C?
>
> > [lex.charset] 6
> > A literal encoding or a locale-specific encoding of one of the
> execution character sets ([character.seq]) encodes each element of the
> basic literal character set as a single code unit with non-negative value,
> distinct from the code unit for any other such element.
> >
> >
> >
> > Is it the set of all possible characters in any encoding that is
> supported?
>
> No, it's "a[n] encoding", not several.
>
> > While it's probably worthwhile not to break someone's existing C++
> reference material, I'm not sure we have a crisp and clean definition here,
> nor am I sure that multibyte character should be tied to it?
> >
> > I don't have any concrete suggestions here, but I was trying to help
> someone else understand the new model, and they had questions that were
> harder to answer than I expected.
>
> "Execution character set" is essentially what comes from your
> current POSIX locale setting at runtime.
>

The execution character set being the character set encoded according to
the currently selected locale's LC_CTYPE category/facet is, I think, a
useful definition. It's a runtime property. It gives an explanation for why
setting locale might be a problem for encoded literals. Encoding one way
and decoding another is at best unspecified. The rules for NTMBSs suggest
that they might become invalid in some interpretations if they don't
unshift out of a shift state. (Although that's terrible for all sorts of
reasons, I'd rather focus on Unicode, and not fix Shift-JIS)

Should multibyte character be coupled to that? I suspect we deal with MB
characters outside the default locale, although I think C only has mb
functions that are locale aware. This isn't critical.

I'm also fine with locale having to encode all of the basic literal
character set, because the alternative is broken, and if your system
somehow lets you do that, it's not C++'s fault that this is broken. As we
discussed with the portable character set, POSIX requires this already.

Received on 2022-04-29 01:01:26