ISOCPP sg16 List: Re: [isocpp-sg16] Clarifying of text encodings work in the standard library.

From: Steve Downey <sdowney_at_[hidden]>
Date: Thu, 7 Nov 2024 10:15:42 -0500

In my opinion the primary purpose of defining the literal and execution
encodings is to be able to explain that it is broken to have those
different if the execution encoding is not a superset of the literal
encoding, like UTF-8 is a superset of 7-bit ascii. This clarifies.

I also think that it's not UB to get this wrong, as it's utterly
deterministic and portable broken behavior. Or possibly even not broken. I
might want to express the sequence "â€™" for reasons of my own, like
talking about mojibake. I don't think there are any _new_ bits of undefined
behavior in the standard library that process characters, this isn't
different from taking /dev/random data.

mod final wordsmithing, SA.

On Thu, Oct 24, 2024 at 3:58 AM Corentin via SG16 <sg16_at_[hidden]>
wrote:

> Hey folks.
> Going by recent discussions, we keep spending a lot of time talking about
> literal and execution encoding in the standard library, which I don't think
> is particularly useful as, in the general case, things don't work if these
> unrelated encodings are, in fact, unrelated.
>
> I think that it would be nice to clarify the expectation put on the
> standard library and its usage so that we can eventually move past that
> point of contention.
>
> I think we should add words to [character.seq.general].
>
> The execution character set
> <https://eel.is/c++draft/library#def:character_set,execution> and the execution
> wide-character set
> <https://eel.is/c++draft/library#def:wide-character_set,execution> are
> supersets of the basic literal character set ([lex.charset]
> <https://eel.is/c++draft/lex.charset>).
> <https://eel.is/c++draft/library#character.seq.general-1.2.sentence-1>
> The encodings of the execution character sets (termed execution encoding
> and wide execution encoding respectively) and the sets of additional
> elements (if any) are locale-specific.
> <https://eel.is/c++draft/library#character.seq.general-1.2.sentence-2>
> Each element of the execution wide-character set is encoded as a single
> code unit representable by a value of type wchar_t.
> <https://eel.is/c++draft/library#character.seq.general-1.2.sentence-3>
>
> [Note 1 <https://eel.is/c++draft/library#character.seq.general-note-1>: The
> encodings of the execution character sets can be unrelated to any literal
> encoding.
> <https://eel.is/c++draft/library#character.seq.general-1.2.sentence-4> — end
> note]
>
> [Note 1: If any element of the literal character set does not have the
> same (or any) representation in the execution encoding as it does in the
> literal encoding, passing a sequence of characters encoded in the literal
> encoding to a standard library function expecting an argument in the
> execution encoding can produce unexpected effects or result in undefined
> behavior.
>
> Similarly, library functions, which expect their arguments in the literal
> encoding may produce unexpected effects or result in undefined behavior
> when passed character sequences in the execution encoding which are not
> valid in the literal encoding.]
>
> [Note 2: sequences of characters are never assumed to be in the execution
> or wide execution encodings during constant evaluation]
>
>
> I proposed something similar a couple of years ago and I think there was
> no appetite for it, but I think we ought to try again.
> There are two things that can happen in practice
> - Both character sets have different mappings but generally the same
> scheme, in which case you get mojibake - which cannot be diagnosed
> - What is a valid sequence in one encoding may not be a valid sequence in
> another encoding in which case you could run into UB in algorithms that do
> not check validity of input sequences (or which we have none today, afaik)
> - and you get a runtime error in other cases
>
> We most likely want to massage that wording, in particular it's unclear
> that we want to say UB as people are scared of that, but i think it's the
> right tool as it is in effect a precondition violations of function taking
> text as argument
> (and we can't diagnose anything - so erroneous behavior seems unpractical)
>
> Robin, is there a standard somewhere that defines "Mojibake" or an
> equivalent term that we could use?
>
>
>
> Cheers
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2024-11-07 15:15:56