C++ Logo

sg16

Advanced search

Re: [SG16] Towards a better description of the execution encoding

From: Corentin <corentin.jabot_at_[hidden]>
Date: Tue, 2 Mar 2021 10:35:03 +0100
On Mon, Mar 1, 2021 at 10:32 PM Hubert Tong <
hubert.reinterpretcast_at_[hidden]> wrote:

> On Mon, Mar 1, 2021 at 10:24 AM Corentin via SG16 <sg16_at_[hidden]>
> wrote:
>
>> Hey folks!
>> Last meeting we talked about the relation between the literal & execution
>> encoding.
>>
>> I think there is pressure to solve this issue (encoding names,
>> std::print, other features).
>> In P2297, I suggested that we say the execution character set is a
>> superset of the literal character set, such that any character in the
>> literal character set results in the same code unit sequence
>> whether it is encoded in the literal encoding or execution encoding.
>>
>> Hubert was concerned this was too restrictive because some ebcdic &
>> iso 646 have codepoints reserved for "national symbols".
>> Even Shift-JIS is not 100% ascii compatible (Yen instead of backslash,
>> overline instead of tilde)
>>
>> I've been thinking about that over the past few days, I think the
>> solution is to not have requirements on the literal character set but
>> rather on the literals themselves.
>>
>> If the execution encoding is UTF8, "ABC" is interpreted identically
>> whether its encoding is ASCII, ISO 646-IT, or Shift-JS.
>>
>> However, "C:\\" would be interpreted as "C:\\", "C:ç" and "C:¥"
>> respectively.
>>
>> So we need to only put requirements on the content of individual literals
>> rather than on the entiere literal set (which, P1885 non whistanding, is
>> not observable during execution anyhow)
>>
>>
>> *A way to word that:*
>>
>> The execution encoding is the locale-specific encoding used to interpret
>> character and NTMBS parameters in character functions, multibyte characters
>> functions and other locale-specific functions.
>>
> I think we can start from something like this. I am guessing that the
> parallel treatment for wide strings is intended?
>

Indeed!
Although, do we know of platforms where the literal and execution wide
encoding would be different?


>
>
>>
>> If character literals and string literals used as arguments to character
>> functions and locale specific functions do not represent the same sequence
>> of abstract characters whether they are interpreted with the literal
>> encoding or the execution encoding the behavior is undefined.
>>
> I don't know how common it is in practice, but deliberately having
> mojibake (as seen in the source) strings is currently a possible way to
> represent strings that are meant to be interpreted at runtime using a
> specific encoding (without immediately raising UB).
>
Presumably the scope of this wording is meant to encompass cases where the
> contents of character/string literals made their way into the arguments via
> assignment/memcpy/etc.? I think this quickly degenerates to
> "string/character arguments to locale sensitive functions are taken as
> being in the locale-specific encoding (and may be processed in a manner
> that does not match their appearance in the source)".
>

Indeed!


> Also, having '\x5c' included in the UB is presumably unintended. The user
> intent is expressed by the numeric escape. Furthermore, consideration
> should be given (and documented) about whether the UB should apply to
> "unparsed" strings (like the argument given to printf for %s) for "locale
> sensitive" functions. Again though, I think that we're really talking about
> there being "natural consequences" of defined behaviour when "bad input" is
> involved.
>

Are you saying that Undefined Behavior would be too big of a hammer?
The way I see it, mojibake is not something that should happen in a
well-behaved program, aka it is a precondition violation of locale-specific
functions.
And I think not having that precondition stated makes it harder to specify
the behavior of std::print for example.

My intent is to say " The standard assumes that all strings are interpreted
by local specific functions as being encoded by the execution encoding and
if that's not the case, you will get mojibake or any other behavior that
may be the result of your input not being interpreted correctly"

>
>
>>
>> I hope that this resolves Hubert concerns and that we can refine the
>> general idea and put that in a paper :)
>>
>> Have a great week,
>> Corentin
>>
>>
>>
>>
>>
>>
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>

Received on 2021-03-02 03:35:19