ISOCPP sg16 List: Re: [isocpp-sg16] Clarifying of text encodings work in the standard library.

From: Victor Zverovich <victor.zverovich_at_[hidden]>
Date: Wed, 6 Nov 2024 14:18:07 -0800

Looks like an improvement to me. Corentin, do you plan to write (or revive)
a paper?

- Victor

On Thu, Oct 24, 2024 at 12:58 AM Corentin via SG16 <sg16_at_[hidden]>
wrote:

> Hey folks.
> Going by recent discussions, we keep spending a lot of time talking about
> literal and execution encoding in the standard library, which I don't think
> is particularly useful as, in the general case, things don't work if these
> unrelated encodings are, in fact, unrelated.
>
> I think that it would be nice to clarify the expectation put on the
> standard library and its usage so that we can eventually move past that
> point of contention.
>
> I think we should add words to [character.seq.general].
>
> The execution character set
> <https://eel.is/c++draft/library#def:character_set,execution> and the execution
> wide-character set
> <https://eel.is/c++draft/library#def:wide-character_set,execution> are
> supersets of the basic literal character set ([lex.charset]
> <https://eel.is/c++draft/lex.charset>).
> <https://eel.is/c++draft/library#character.seq.general-1.2.sentence-1>
> The encodings of the execution character sets (termed execution encoding
> and wide execution encoding respectively) and the sets of additional
> elements (if any) are locale-specific.
> <https://eel.is/c++draft/library#character.seq.general-1.2.sentence-2>
> Each element of the execution wide-character set is encoded as a single
> code unit representable by a value of type wchar_t.
> <https://eel.is/c++draft/library#character.seq.general-1.2.sentence-3>
>
> [Note 1 <https://eel.is/c++draft/library#character.seq.general-note-1>: The
> encodings of the execution character sets can be unrelated to any literal
> encoding.
> <https://eel.is/c++draft/library#character.seq.general-1.2.sentence-4> — end
> note]
>
> [Note 1: If any element of the literal character set does not have the
> same (or any) representation in the execution encoding as it does in the
> literal encoding, passing a sequence of characters encoded in the literal
> encoding to a standard library function expecting an argument in the
> execution encoding can produce unexpected effects or result in undefined
> behavior.
>
> Similarly, library functions, which expect their arguments in the literal
> encoding may produce unexpected effects or result in undefined behavior
> when passed character sequences in the execution encoding which are not
> valid in the literal encoding.]
>
> [Note 2: sequences of characters are never assumed to be in the execution
> or wide execution encodings during constant evaluation]
>
>
> I proposed something similar a couple of years ago and I think there was
> no appetite for it, but I think we ought to try again.
> There are two things that can happen in practice
> - Both character sets have different mappings but generally the same
> scheme, in which case you get mojibake - which cannot be diagnosed
> - What is a valid sequence in one encoding may not be a valid sequence in
> another encoding in which case you could run into UB in algorithms that do
> not check validity of input sequences (or which we have none today, afaik)
> - and you get a runtime error in other cases
>
> We most likely want to massage that wording, in particular it's unclear
> that we want to say UB as people are scared of that, but i think it's the
> right tool as it is in effect a precondition violations of function taking
> text as argument
> (and we can't diagnose anything - so erroneous behavior seems unpractical)
>
> Robin, is there a standard somewhere that defines "Mojibake" or an
> equivalent term that we could use?
>
>
>
> Cheers
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2024-11-06 22:18:19