sg16: Re: [SG16] Is the concept of basic execution character sets useful?

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Wed, 3 Feb 2021 23:32:30 +0100

On 03/02/2021 22.55, Corentin wrote:
>
>
> On Wed, Feb 3, 2021 at 10:03 PM Jens Maurer <Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>> wrote:
> > Same code point value.
> > Say your literal encoding is ASCII, the code point value for 'A' is 65, then the execution encoding is such that the code point value of A is also 65.
>
> And that means std::isalpha, for example, will return true?
> Are any other functions affected by that constraint?
> Where did we have that constraint previously?
> Where is the C++20 normative statement for the edited
> footnote in [multibyte.string]?

> I think the footnote only says that NTBS are NTMBS

The footnote previously said that an NTBS only containing
basic execution characters is also an NTBMS.

The footnote as modified by your paper says that any
NTBS only containing single-byte characters is also an NTBMS.

The C++20 wording admits that characters outside of the
basic character set could be a single byte in an NTBS,
but could be multiple bytes in an NTBMS.
Think of NTBS = ISO 8859-1 and NTBMS = UTF-8, and the character
"Ä" (or any of the French accented characters).

> And does that mean I can't compile a program with an EBCDIC
> compiler (producing EBCDIC literal encoding) and then
> running it in an ASCII environment? Or does that just
> mean certain functions won't work on literals as
> expected, e.g. std::isalpha('a') might not return true?
>
>
> Certain functions will be UB. They already are, in that is in your scenario isalpha('a') violates the precondition that 'a' is a character in the encoding of the current locale

Agreed.

> std::string(runtime_string).find('a') will also return non sense

Agreed.

> That constraint is currently not specified but, during execution, the program does not distinguish literals from runtime data, or ordinary literal encoding from execution encoding.
> There are just strings assumed to be in execution encoding and if they aren't they violate all of these functions preconditions.

The we should state preconditions for these functions, instead of saying that
the execution encoding must fit the literal encoding, always, even when
my program never calls one of the problematic functions.
(Note that we're specifying requirements on C++ implementations, so
presumably we're now saying that an EBCDIC-compiled program must refuse
startup in an ASCII environment. That seems over-the-top.)

Don't take away my footgun.

> > I struggled a bit with the formulation.
> > I'm trying to say that both the execution character set and encoding are ""super sets"" of the literal ones, but "super set" of encoding does not seem like a good formulation.
>
> Where do we say that in the C++20 wording?
>
>
> We don't. We should. (unless we are happy with isalpha('a') returning false, puts("a") not displaying a and string("a").find('a') returning npos !

Yes, I'm fine with that, if we specify that's going to happen
in case the execution encoding (conveyed via locale) doesn't
match the literal encoding.

> But I also don't see where the standard ever admits currently that the execution encoding as defined in [lex] can ever be different from the one used through the library.

Right, that's why [lex] shouldn't talk about execution character set at all.

> I think for the standard they are currently one of the same, and if we want to split execution encoding from literal encoding there should be a description of how they relate to one another

The point is they don't relate to one another.

Jens

Received on 2021-02-03 16:32:40