ISOCPP sg16 List: Re: [isocpp-sg16] Clarifying of text encodings work in the standard library.

From: Steve Downey <sdowney_at_[hidden]>
Date: Thu, 7 Nov 2024 10:59:03 -0500

However, if we don't say UB, I'm SF.

On Thu, Nov 7, 2024 at 10:15 AM Steve Downey <sdowney_at_[hidden]> wrote:

> In my opinion the primary purpose of defining the literal and execution
> encodings is to be able to explain that it is broken to have those
> different if the execution encoding is not a superset of the literal
> encoding, like UTF-8 is a superset of 7-bit ascii. This clarifies.
>
> I also think that it's not UB to get this wrong, as it's utterly
> deterministic and portable broken behavior. Or possibly even not broken. I
> might want to express the sequence "â€™" for reasons of my own, like
> talking about mojibake. I don't think there are any _new_ bits of undefined
> behavior in the standard library that process characters, this isn't
> different from taking /dev/random data.
>
> mod final wordsmithing, SA.
>
>
> On Thu, Oct 24, 2024 at 3:58 AM Corentin via SG16 <sg16_at_[hidden]>
> wrote:
>
>> Hey folks.
>> Going by recent discussions, we keep spending a lot of time talking about
>> literal and execution encoding in the standard library, which I don't think
>> is particularly useful as, in the general case, things don't work if these
>> unrelated encodings are, in fact, unrelated.
>>
>> I think that it would be nice to clarify the expectation put on the
>> standard library and its usage so that we can eventually move past that
>> point of contention.
>>
>> I think we should add words to [character.seq.general].
>>
>> The execution character set
>> <https://eel.is/c++draft/library#def:character_set,execution> and the execution
>> wide-character set
>> <https://eel.is/c++draft/library#def:wide-character_set,execution> are
>> supersets of the basic literal character set ([lex.charset]
>> <https://eel.is/c++draft/lex.charset>).
>> <https://eel.is/c++draft/library#character.seq.general-1.2.sentence-1>
>> The encodings of the execution character sets (termed execution encoding
>> and wide execution encoding respectively) and the sets of additional
>> elements (if any) are locale-specific.
>> <https://eel.is/c++draft/library#character.seq.general-1.2.sentence-2>
>> Each element of the execution wide-character set is encoded as a single
>> code unit representable by a value of type wchar_t.
>> <https://eel.is/c++draft/library#character.seq.general-1.2.sentence-3>
>>
>> [Note 1 <https://eel.is/c++draft/library#character.seq.general-note-1>: The
>> encodings of the execution character sets can be unrelated to any literal
>> encoding.
>> <https://eel.is/c++draft/library#character.seq.general-1.2.sentence-4> — end
>> note]
>>
>> [Note 1: If any element of the literal character set does not have the
>> same (or any) representation in the execution encoding as it does in the
>> literal encoding, passing a sequence of characters encoded in the literal
>> encoding to a standard library function expecting an argument in the
>> execution encoding can produce unexpected effects or result in undefined
>> behavior.
>>
>> Similarly, library functions, which expect their arguments in the literal
>> encoding may produce unexpected effects or result in undefined behavior
>> when passed character sequences in the execution encoding which are not
>> valid in the literal encoding.]
>>
>> [Note 2: sequences of characters are never assumed to be in the execution
>> or wide execution encodings during constant evaluation]
>>
>>
>> I proposed something similar a couple of years ago and I think there was
>> no appetite for it, but I think we ought to try again.
>> There are two things that can happen in practice
>> - Both character sets have different mappings but generally the same
>> scheme, in which case you get mojibake - which cannot be diagnosed
>> - What is a valid sequence in one encoding may not be a valid
>> sequence in another encoding in which case you could run into UB in
>> algorithms that do not check validity of input sequences (or which we have
>> none today, afaik) - and you get a runtime error in other cases
>>
>> We most likely want to massage that wording, in particular it's unclear
>> that we want to say UB as people are scared of that, but i think it's the
>> right tool as it is in effect a precondition violations of function taking
>> text as argument
>> (and we can't diagnose anything - so erroneous behavior seems unpractical)
>>
>> Robin, is there a standard somewhere that defines "Mojibake" or an
>> equivalent term that we could use?
>>
>>
>>
>> Cheers
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>

Received on 2024-11-07 15:59:18