C++ Logo

sg16

Advanced search

[isocpp-sg16] Clarifying of text encodings work in the standard library.

From: Corentin <corentin.jabot_at_[hidden]>
Date: Thu, 24 Oct 2024 09:57:44 +0200
Hey folks.
Going by recent discussions, we keep spending a lot of time talking about
literal and execution encoding in the standard library, which I don't think
is particularly useful as, in the general case, things don't work if these
unrelated encodings are, in fact, unrelated.

I think that it would be nice to clarify the expectation put on the
standard library and its usage so that we can eventually move past that
point of contention.

I think we should add words to [character.seq.general].

The execution character set
<https://eel.is/c++draft/library#def:character_set,execution> and the execution
wide-character set
<https://eel.is/c++draft/library#def:wide-character_set,execution> are
supersets of the basic literal character set ([lex.charset]
<https://eel.is/c++draft/lex.charset>).
<https://eel.is/c++draft/library#character.seq.general-1.2.sentence-1> The
encodings of the execution character sets (termed execution encoding and
wide execution encoding respectively) and the sets of additional elements
(if any) are locale-specific.
<https://eel.is/c++draft/library#character.seq.general-1.2.sentence-2> Each
element of the execution wide-character set is encoded as a single code
unit representable by a value of type wchar_t.
<https://eel.is/c++draft/library#character.seq.general-1.2.sentence-3>

[Note 1 <https://eel.is/c++draft/library#character.seq.general-note-1>: The
encodings of the execution character sets can be unrelated to any literal
encoding.
<https://eel.is/c++draft/library#character.seq.general-1.2.sentence-4> — end
note]

[Note 1: If any element of the literal character set does not have the same
(or any) representation in the execution encoding as it does in the literal
encoding, passing a sequence of characters encoded in the literal encoding
to a standard library function expecting an argument in the execution
encoding can produce unexpected effects or result in undefined behavior.

Similarly, library functions, which expect their arguments in the literal
encoding may produce unexpected effects or result in undefined behavior
when passed character sequences in the execution encoding which are not
valid in the literal encoding.]

[Note 2: sequences of characters are never assumed to be in the execution
or wide execution encodings during constant evaluation]


I proposed something similar a couple of years ago and I think there was no
appetite for it, but I think we ought to try again.
There are two things that can happen in practice
 - Both character sets have different mappings but generally the same
scheme, in which case you get mojibake - which cannot be diagnosed
 - What is a valid sequence in one encoding may not be a valid sequence in
another encoding in which case you could run into UB in algorithms that do
not check validity of input sequences (or which we have none today, afaik)
- and you get a runtime error in other cases

We most likely want to massage that wording, in particular it's unclear
that we want to say UB as people are scared of that, but i think it's the
right tool as it is in effect a precondition violations of function taking
text as argument
(and we can't diagnose anything - so erroneous behavior seems unpractical)

Robin, is there a standard somewhere that defines "Mojibake" or an
equivalent term that we could use?



Cheers

Received on 2024-10-24 07:58:04