sg12: Re: [ub] C provenance semantics proposal (without pointer equality)

From: Peter Sewell <Peter.Sewell_at_[hidden]>
Date: Mon, 15 Apr 2019 22:29:34 +0100

On 15/04/2019, Richard Smith <richardsmith_at_[hidden]> wrote:
> On Mon, Apr 15, 2019 at 7:16 AM Peter Sewell <Peter.Sewell_at_[hidden]>
[...]
>> We've also heard suggestions that compilers do things here, but in C
>> the container-of idiom seems pervasive, and WG14 at the last meeting
>> expressed a large majority that it has to be permitted by the
>> standard.
>>
>
> This is likely one of the areas where C++ wants to have stricter rules than
> C, at least for some categories of types. (We already have limitations on
> the types for which offsetof can be used, and don't require offsets to be
> constant in all cases, and even permit some flavours of structs to not
> store their members within the memory associated with the struct object at
> all.) We probably want to follow C in this area at least for
> standard-layout types, though.

y

>> As a consequence of the above, the specification for std::launder says:
>> >
>> > """
>> > Expects: [...] All bytes of storage that would be reachable through the
>> result are reachable through p (see below).
>> > [...]
>> > Remarks: [...] A byte of storage is reachable through a pointer value
>> that points to an object Y if it is within the storage occupied by Y, an
>> object that is pointer-interconvertible with Y, or the
>> immediately-enclosing array object if Y is an array element.
>> > """
>>
>> (small thing: you might or might not want that to support movement
>> through nested arrays)
>>
>> > Can this be accommodated by the "recreate the provenance on
>> integer-to-pointer cast" model? I think it's not accommodated by the
>> approach you describe above -- the above description would seem to
>> suggest
>> that we need to allow access to the entirety of the largest live object
>> whose address was taken and which encloses the address represented by the
>> integer. So casting an A* to an integer could allow navigation to the
>> enclosing B, if the B object has also had its address cast to an integer.
>>
>> Yes. The current proposal allows that. We've not nailed down
>> subobject issues, but one approach that several of us favour would be
>> to enforce subobject boundaries except for void* or character-type*
>> pointer arithmetic - that would let the offsetof container-of idioms
>> still work, while permitting error detection in cases where (eg)
>> someone does int* pointer arithmetic to move between struct members.
>>
>
> I think something like that would make sense.
>
> C++ has a notion of an array of byte-like type providing storage for
> another object (with no formal subobject relationship). I think it would
> make sense to say that all mechanisms by which storage is allocated (of all
> storage durations) implicitly create an array of such a byte-like type
> covering the entire allocation, so every object either is, or is nested
> within, some array of byte-like type representing a storage allocation.
> Then pointer arithmetic on pointers to elements of that enclosing array
> would permit arbitrary navigation, and otherwise navigation would be
> constrained to subobject relationships.

y (Jens G has a proposal for C along those lines, though as I say
we've not yet actually done the subobject bits yet)

>> Personally I'm not sure how far we should go to defend this "irreversible
>> subobject navigation" property, but my implementation isn't one that
>> takes
>> advantage of it.
>> >
>> >> A storage instance is deemed exposed by a cast of a pointer to it to
>> >> an integer type, by a read (at non-pointer type) of the representation
>> >> of the pointer, or by an output of the pointer using %p.
>> >
>> >
>> > Hmm. Does that mean that evaluation of a pointer-to-integer cast has a
>> side-effect, and cannot in general be optimized away even if its result
>> is
>> unused?
>>
>> In the source language, yes. An intermediate language might perhaps
>> use a more liberal semantics that doesn't rely on that side effect.
>>
>> >Or is there an assumption being built in here that the only way to form
>> an equal integer value to cast back to a pointer will necessarily involve
>> a
>> computation that actually depends on the pointer-to-integer cast, even if
>> we don't explicitly require that?
>>
>> That will typically be true - that's what allocation-address
>> nondeterminism buys you.
>>
>> > How careful do we need to be in future to avoid breaking that
>> assumption? Consider case such as:
>> >
>> > int f(int mode, int offset = 0) {
>> > int a, b;
>> > if (mode == 1) { return (intptr_t)&a - (intptr_t)&b; }
>> > intptr_t a_int = (intptr_t)&a;
>> > intptr_t evil_a_int = (intptr_t)&b + offset;
>> > int *evil_a = (int*)evil_a_int;
>> > printf("%" PRIdPTR " %" PRIdPTR " %p %p\n", a_int, evil_a_int, &a,
>> evil_a);
>> > if (getchar() == 'x') std::terminate(); // #1, allow user to abort if
>> integer/pointer values differ
>> > a = 1;
>> > return *evil_a;
>> > }
>> > int main() { return f(2, f(1)); }
>>
>> The precise interaction of UB with user input is another can of worms
>> that we've not yet really opened :-) But, assuming that programs
>> can rely on facts about user input (perhaps based on whatever has been
>> output), which seems reasonable, and that here the user is "required"
>> to press x if the values differ, then
>>
>> > Is my implementation conforming if it prints out equal integer and
>> pointer values here, and yet (after the program is resumed by the user)
>> main doesn't return 1? Based on your description, I think the answer is
>> no,
>> which means that my weird pointer comparison machine (including a human
>> component) is effectively enough to keep the pointer exposed.
>>
>> As Martin said, the mode==1 execution of f isn't very interesting, as
>> the second allocations of a and b could be arbitrarily spaced. But
>> then in the mode!=1 execution, a and b are both exposed, so the cast
>> (int*)evil_a_int will give usable provenance iff the passed-in offset
>> is correct, so if the user is required to check that, this should be
>> well-defined to return 1.
>>
>> If line #1 is deleted, then in some executions a_int will be distinct
>> to evil_a_int (by allocation nondeterminism), so indeed it will be UB.
>>
>> >However, if line #1 is deleted, I believe an implementation that prints
>> out equal values and then does not return 1 would be correct: the
>> implementation can claim that a_int != evil_a_int, so that the 'return
>> *evil_a;' had undefined behavior, and therefore it was permitted to do
>> whatever it liked, including printing out values that appeared to be the
>> same.
>>
>> >I /think/ the upshot is that you *can* optimize away a
>> > pointer-to-integer
>> cast if its value is unused, but you cannot optimize one away in the
>> presence of potential data- or control-dependencies on it (including
>> indirect ones such as escaping the pointers through the IO system and
>> comparing them somewhere else, then reading back the result) that might
>> in
>> any way influence a later integer-to-pointer cast. Does that match your
>> intent?
>>
>> yes
>>
>> >
>> > Integer-to-pointer casts seem to have surprising evaluation semantics
>> > as
>> a result of this approach. It seems reasonable to me to give cases like
>> this defined behavior:
>> >
>> > char buffer[32];
>> > struct A { int n; };
>> > struct B : A {};
>> > struct C : A {};
>> > A *p = new (buffer) B;
>> > intptr_t x = (intptr_t)p;
>> > A *q = new (buffer) C;
>> > A *r = (A*)x; // ok, r points to the A base of the C object
>> > r->n = 123; // OK, r points to the new A object not the old one
>>
>> If we think of an analogous example in which the B and C objects are
>> heap or stack allocated (with non-overlapping lifetimes) that happen
>> to get the same address, with the code checking that, then:
>>
>> > ... but it matters when the cast from integer type to pointer type is
>> performed.
>>
>> that does indeed make a difference.
>>
>> >If we move the cast to pointer type earlier, the resulting example would
>> not be defined under the "points within a live object" / launder model:
>> >
>> > A *p = new (buffer) B;
>> > intptr_t x = (intptr_t)p;
>> > A *r = (A*)x; // ok, r points to the A base of the B object
>> > A *q = new (buffer) C;
>> > r->n = 123; // undefined: r points to the old A object that is not
>> within its lifetime
>> >
>> > Giving casts between pointer and integer types side-effects, and
>> > effects
>> that depend on when they're evaluated, makes me nervous.
>>
>> As (I think) Jens said, the pointer-to-integer cast side-effect, of
>> marking the storage instance as exposed, seems like it more-or-less
>> has to be a temporal thing.
>>
>
> If I'm understanding the model correctly, I think that depends on your
> perspective.
>
> If I understand correctly, a pointer-to-integer cast is only temporal to
> the extent that there must be something "forcing" it to happen before a
> integer-to-pointer cast that depends upon it exposing an object. (Ideally,
> I think we'd like to specify that the integer-to-pointer cast has a
> dependency on the pointer-to-integer cast, but that's a major can of worms
> comparable to the consume memory order.)

(quite. let's not :-)

> If we model a program execution as
> a set of possible executions (with defined behavior only if all possible
> executions have defined behavior), and model pointers as having
> nondeterministic corresponding integer values, then a pointer-to-integer
> cast (along with the other mechanisms that expose pointers as integers) can
> be viewed as a pure mathematical function (and in particular, it has no
> temporal dependence nor side-effects), but it's an oracle that exposes
> information that is not observable in any other way -- and we can see that
> a program that never uses the oracle cannot possibly correctly "guess" the
> integer corresponding to a pointer[1][2], so any integer-to-pointer
> conversion must necessarily introduce a possible execution with undefined
> behavior.
> [1]: The oracle is the only way to determine the relevant information
> about which execution in the set of possible executions is currently
> occurring.

All that's a pretty exactly description of the PNVI-plain variant.
PNVI-ae-udi keeps the result of the pointer-to-integer cast pure, but
adds the "make this storage instance exposed" side effect. And that
makes a difference only to the integer-to-pointer casts one can do.

(treating reads of representation bytes and suchlike as similar to p-to-i casts,
and accesses via pointers that have, for example, had some of their
bytes written via char* pointers, as similar to i-to-p casts)

> [2]: In principle, a program could keep allocating memory and casting
> pointers to integers until it's seen all integer values except one, and
> then correctly guess the address of the remaining object. But an
> implementation can prevent that by refusing to allocate all addressable
> memory.

y. This is the concern that pushed Juneyoung Lee et al. into their twin
allocation model. But instead, for a source language, we can just
limit attention to programs that never almost-exhaust memory
(ie that leave space for a copy of their biggest (and suitably
aligned) allocation).

>
>> The integer-to-pointer cast *could* be left ambiguous until it's
>> resolved - but that allows other strange things, eg it can be
>> (eventually) resolved to an object that didn't even exist earlier.
>> The temporal view seems (I hope :-) quite easy for programmers to
>> understand.
>>
>
> I think it's surprising either way -- either you find that a cast from
> integer to pointer has an execution side-effect, and it matters where you
> write it, or you find that you can cast an integer into a pointer to an
> object that doesn't yet exist at the point of the cast.

true

> I think the former is surprising for practicing programmers, whereas the
> latter is likely only surprising for language lawyers.

hmm - the latter would be annoyingly complex to specify (like the "udi"
part of PNVI-ae-udi but worse, as one gradually accumulates constraints
on what the result of such a cast might be pointing to, based on
what arithmetic is done to the pointer). It's hard to imagine that becoming
widely understood...

> Moreover, I'd expect
> the latter to be the model that implementations actually use,

Can you expand on that? Not sure I understand.

> so the
> difference between the two will become a theoretical gotcha and we'll end
> up with theoretically non-portable code relying on such casts being
> timeless.
>
> Regarding the pointer_from_integer_1pg.c example, I think we could address
> that case in a somewhat different way. Currently in C++ we have a rule that
> says:
>
> """
> When the end of the duration of a region of storage is reached, the values
> of all pointers representing the address of any part of that region of
> storage become invalid pointer values (6.7.2).
> """
>
> and similarly in C:
>
> """
> The value of a pointer becomes indeterminate when the object it points to
> (or just past) reaches the end of its lifetime.
> """

Yes, although there is currently a move for C to at least partially remove that,
as there are a bunch of longstanding concurrent algorithms that depend on
== comparison with pointers to lifetime-ended objects. Not sure what will
happen there.

> (I'm ignoring the memory model problems with the use of "become[s]" here.)
> It would seem natural to extend this so it applies at both the point of
> allocation and the point of deallocation. Then it's not the cast to pointer
> that has temporal behavior; rather, it's the allocation of the
> automatic-storage-duration variable that causes there to be no valid
> pointers into that region of storage, and changes the value of 'p' in the
> example to an invalid/indeterminate pointer value. (It still ends up
> mattering whether you perform the cast inside or outside the function, but
> for a different reason.)

That's exotic - never thought of that possibility. But I would like to have a
semantics that doesn't rely on lifetime end-zap if we can.

> This might be equivalent to integer-to-pointer casts being temporal in C
> (because objects are by definition the same as the storage regions they
> occupy), but not in C++.
>
>>I think it'd be preferable to give them a single-but-unknown provenance,
>> following the "pick whichever single value makes the rest of the program
>> work" model of P0593R3 (notionally pretty similar to C's effective type
>> rule -- you resolve to the first provenance that you use with the
>> pointer).

If an access is first, that's not too complex, but if pointer arithmetic happens
first, it gets messy to record the resulting constraints. And we get a lot more
instantaneous action-at-a-distance as these get resolved.

>> That'd mean casts to pointer type are timeless and freely reorderable,
>> and
>> both the above examples are defined, not only the first one. However,
>> N2363
>> has a scary example involving guessing the address of a function
>> parameter
>> that is apparently defanged by the time-dependence of integer-to-pointer
>> casts. To what extent is that essential? Is there a different way that
>> guessing a pointer value could be disallowed?
>>
>> Simple allocation-address nondeterminism disallows pointer value
>> guessing.
>>
>> thanks,
>> Peter
>>
>>
>> >> The user-disambiguation refinement adds some complexity but supports
>> >> roundtrip casts, from pointer to integer and back, of pointers that
>> >> are one-past a storage instance.
>> >>
>> >>
>> >>
>>
>

best,
Peter

Received on 2019-04-15 23:29:36