In my opinion the primary purpose of defining the literal and execution encodings is to be able to explain that it is broken to have those different if the execution encoding is not a superset of the literal encoding, like UTF-8 is a superset of 7-bit ascii. This clarifies.

I also think that it's not UB to get this wrong, as it's utterly deterministic and portable broken behavior. Or possibly even not broken. I might want to express the sequence "â€™" for reasons of my own, like talking about mojibake. I don't think there are any _new_ bits of undefined behavior in the standard library that process characters, this isn't different from taking /dev/random data.

mod final wordsmithing, SA.

On Thu, Oct 24, 2024 at 3:58 AM Corentin via SG16 <sg16@lists.isocpp.org> wrote:

Hey folks.
Going by recent discussions, we keep spending a lot of time talking about literal and execution encoding in the standard library, which I don't think is particularly useful as, in the general case, things don't work if these unrelated encodings are, in fact, unrelated.

I think that it would be nice to clarify the expectation put on the standard library and its usage so that we can eventually move past that point of contention.

I think we should add words to [character.seq.general].

The execution character set and the execution wide-character set are supersets of the basic literal character set ([lex.charset]). The encodings of the execution character sets (termed execution encoding and wide execution encoding respectively) and the sets of additional elements (if any) are locale-specific. Each element of the execution wide-character set is encoded as a single code unit representable by a value of type wchar_t.

[Note 1: The encodings of the execution character sets can be unrelated to any literal encoding. — end note]

[Note 1: If any element of the literal character set does not have the same (or any) representation in the execution encoding as it does in the literal encoding, passing a sequence of characters encoded in the literal encoding to a standard library function expecting an argument in the execution encoding can produce unexpected effects or result in undefined behavior.
Similarly, library functions, which expect their arguments in the literal encoding may produce unexpected effects or result in undefined behavior when passed character sequences in the execution encoding which are not valid in the literal encoding.]

[Note 2: sequences of characters are never assumed to be in the execution or wide execution encodings during constant evaluation]

I proposed something similar a couple of years ago and I think there was no appetite for it, but I think we ought to try again.
There are two things that can happen in practice
- Both character sets have different mappings but generally the same scheme, in which case you get mojibake - which cannot be diagnosed
- What is a valid sequence in one encoding may not be a valid sequence in another encoding in which case you could run into UB in algorithms that do not check validity of input sequences (or which we have none today, afaik) - and you get a runtime error in other cases

We most likely want to massage that wording, in particular it's unclear that we want to say UB as people are scared of that, but i think it's the right tool as it is in effect a precondition violations of function taking text as argument
(and we can't diagnose anything - so erroneous behavior seems unpractical)

Robin, is there a standard somewhere that defines "Mojibake" or an equivalent term that we could use?

Cheers
--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16