sg12: Re: [ub] C provenance semantics proposal (without pointer equality)

From: Peter Sewell <Peter.Sewell_at_[hidden]>
Date: Mon, 15 Apr 2019 15:15:51 +0100

Hi Richard,

On Sat, 13 Apr 2019 at 02:45, Richard Smith <richardsmith_at_[hidden]> wrote:
>
> On Fri, Apr 12, 2019 at 5:09 AM Peter Sewell <Peter.Sewell_at_[hidden]> wrote:
>>
>> Hi all,
>>
>> perhaps I can reset this discussion, which got bogged down in largely
>> independent questions about pointer equality, by summarising the
>> basic provenance proposal. Try this below, extracted from n2363.
>> Comments on this?
>
>
> Thank you for prompting this discussion, and for the huge work that's gone into N2363.
>
>>
>> best,
>> Peter
>>
>>
>>
>> C pointer values are typically represented at runtime as simple
>> concrete numeric values, but mainstream compilers routinely exploit
>> information about the "provenance" of pointers to reason that they
>> cannot alias, and hence to justify optimisations. This is
>> long-standing practice, but exactly what it means (what programmers
>> can rely on, and what provenance-based alias analysis is allowed to
>> do), has never been nailed down.
>
>
> Not in C, but C++ has at least some rules for this already (although they are incomplete).
>
>> That's what the proposal does.
>>
>>
>> The basic idea is to associate a *provenance* with every pointer
>> value, identifying the original storage instance (or allocation, in
>> other words) that the pointer is derived from. In more detail:
>>
>> - We take abstract-machine pointer values to be pairs (pi,a), adding a
>> provenance pi, either @i where i is a storage instance ID, or the
>> *empty* provenance, to their concrete address a.
>>
>> - On every storage instance creation (of objects with static, thread,
>> automatic, and allocated storage duration), the abstract machine
>> nondeterministically chooses a fresh storage instance ID i (unique
>> across the entire execution), and the resulting pointer value
>> carries that single storage instance ID as its provenance @i.
>
>
> C++'s rules are somewhat stronger than this. The "provenance" information we use is associated with an object, not merely with a region of storage. For example, given:
>
> char buffer[32];
> struct A { int n; };
> struct B : A {};
> struct C : A {};
> A *p = new (buffer) B;
> A *q = new (buffer) C;
>
> ... the "provenance" associated with p and q are different. In C++'s terminology, p and q point to two different A objects (with non-overlapping lifetime; reuse of the storage containing *p results in the end of the lifetime of the *p object) that occupy the same storage (at different times).

Y. In C, of course, we don't have these handy syntactic markers of
when an "object" is created within a malloc'd region.

>> - Provenance is preserved by pointer arithmetic that adds or subtracts
>> an integer to a pointer.
>>
>> - At any access via a pointer value, its numeric address must be
>> consistent with its provenance, with undefined behaviour
>> otherwise.
>
>
> In C++ terms, an object can only be accessed within its lifetime.
>
>>
>> In particular:
>>
>> -- access via a pointer value which has provenance a single storage
>> instance ID @i must be within the memory footprint of the
>> corresponding original storage instance, which must still be
>> live.
>>
>> -- all other accesses, including those via a pointer value with
>> empty provenance, are undefined behaviour.
>>
>> Regarding such accesses as undefined behaviour is necessary to make
>> optimisation based on provenance alias analysis sound: if the standard
>> did define behaviour for programs that make provenance-violating
>> accesses, e.g.~by adopting a concrete semantics, optimisation based on
>> provenance-aware alias analysis would not be sound. In other words,
>> the provenance lets one distinguish a one-past pointer from a pointer
>> to the start of an adjacently-allocated object, which otherwise are
>> indistinguishable.
>>
>> All this is for the C abstract machine as defined in the standard:
>> compilers might rely on provenance in their alias analysis and
>> optimisation, but one would not expect normal implementations to
>> record or manipulate provenance at runtime (though dynamic or static
>> analysis tools might).
>>
>>
>> Then, to support low-level systems programming, C provides many other
>> ways to construct and manipulate pointer values:
>
>
> I think this is this point where we're going beyond what C++ currently specifies.
>
>>
>> - casts of pointers to integer types and back, possibly with integer
>> arithmetic, e.g.~to force alignment, or to store information in
>> unused bits of pointers;
>>
>> - copying pointer values with memcpy;
>>
>> - manipulation of the representation bytes of pointers, e.g.~via user
>> code that copies them via char* or unsigned char* accesses;
>>
>> - type punning between pointer and integer values;
>>
>> - I/O, using either fprintf/fscanf and the %p format, fwrite/fread on
>> the pointer representation bytes, or pointer/integer casts and
>> integer I/O;
>>
>> - copying pointer values with realloc; and
>>
>> - constructing pointer values that embody knowledge established from
>> linking, and from constants that represent the addresses of
>> memory-mapped devices.
>>
>>
>> A satisfactory semantics has to address all these, together with the
>> implications on optimisation. We've explored several, but our main
>> proposal is "PNVI-ae-udi" (provenance not via integers,
>> address-exposed, user-disambiguation).
>>
>> This semantics does not track provenance via integers. Instead, at
>> integer-to-pointer cast points, it checks whether the given address
>> points within a live object that has previously been *exposed* and, if
>> so, recreates the corresponding provenance.
>
>
> For C++, this is almost equivalent to saying that an implicit 'launder' is performed on an integer-to-pointer cast. However, this is not a complete answer, because we first need to form a pointer value before we launder, and it matters which pointer value we form. Consider:
>
> struct A { int x; };
> struct B { int y; A a; };
> void f(A *a);
> void g() {
> B b = {1, 2};
> f(&b.a);
> return b.y;
> }
>
> In C++, it is valid to transform 'g' in the above program to:
>
> void g() {
> A a = {2};
> f(&a);
> return 1;
> }
>
> ... because we carefully ensure that there is no defined way to navigate from an A* to the B* containing it. I'm told that some compilers actually perform optimizations in this area, at least in some cases. When we were formulating the C++ rules, there was vendor push-back on adding any defined mechanism that would permit navigation from an A* to its enclosing B*.

We've also heard suggestions that compilers do things here, but in C
the container-of idiom seems pervasive, and WG14 at the last meeting
expressed a large majority that it has to be permitted by the
standard.

> As a consequence of the above, the specification for std::launder says:
>
> """
> Expects: [...] All bytes of storage that would be reachable through the result are reachable through p (see below).
> [...]
> Remarks: [...] A byte of storage is reachable through a pointer value that points to an object Y if it is within the storage occupied by Y, an object that is pointer-interconvertible with Y, or the immediately-enclosing array object if Y is an array element.
> """

(small thing: you might or might not want that to support movement
through nested arrays)

> Can this be accommodated by the "recreate the provenance on integer-to-pointer cast" model? I think it's not accommodated by the approach you describe above -- the above description would seem to suggest that we need to allow access to the entirety of the largest live object whose address was taken and which encloses the address represented by the integer. So casting an A* to an integer could allow navigation to the enclosing B, if the B object has also had its address cast to an integer.

Yes. The current proposal allows that. We've not nailed down
subobject issues, but one approach that several of us favour would be
to enforce subobject boundaries except for void* or character-type*
pointer arithmetic - that would let the offsetof container-of idioms
still work, while permitting error detection in cases where (eg)
someone does int* pointer arithmetic to move between struct members.

> Personally I'm not sure how far we should go to defend this "irreversible subobject navigation" property, but my implementation isn't one that takes advantage of it.
>
>> A storage instance is deemed exposed by a cast of a pointer to it to
>> an integer type, by a read (at non-pointer type) of the representation
>> of the pointer, or by an output of the pointer using %p.
>
>
> Hmm. Does that mean that evaluation of a pointer-to-integer cast has a side-effect, and cannot in general be optimized away even if its result is unused?

In the source language, yes. An intermediate language might perhaps
use a more liberal semantics that doesn't rely on that side effect.

>Or is there an assumption being built in here that the only way to form an equal integer value to cast back to a pointer will necessarily involve a computation that actually depends on the pointer-to-integer cast, even if we don't explicitly require that?

That will typically be true - that's what allocation-address
nondeterminism buys you.

> How careful do we need to be in future to avoid breaking that assumption? Consider case such as:
>
> int f(int mode, int offset = 0) {
> int a, b;
> if (mode == 1) { return (intptr_t)&a - (intptr_t)&b; }
> intptr_t a_int = (intptr_t)&a;
> intptr_t evil_a_int = (intptr_t)&b + offset;
> int *evil_a = (int*)evil_a_int;
> printf("%" PRIdPTR " %" PRIdPTR " %p %p\n", a_int, evil_a_int, &a, evil_a);
> if (getchar() == 'x') std::terminate(); // #1, allow user to abort if integer/pointer values differ
> a = 1;
> return *evil_a;
> }
> int main() { return f(2, f(1)); }

The precise interaction of UB with user input is another can of worms
that we've not yet really opened :-) But, assuming that programs
can rely on facts about user input (perhaps based on whatever has been
output), which seems reasonable, and that here the user is "required"
to press x if the values differ, then

> Is my implementation conforming if it prints out equal integer and pointer values here, and yet (after the program is resumed by the user) main doesn't return 1? Based on your description, I think the answer is no, which means that my weird pointer comparison machine (including a human component) is effectively enough to keep the pointer exposed.

As Martin said, the mode==1 execution of f isn't very interesting, as
the second allocations of a and b could be arbitrarily spaced. But
then in the mode!=1 execution, a and b are both exposed, so the cast
(int*)evil_a_int will give usable provenance iff the passed-in offset
is correct, so if the user is required to check that, this should be
well-defined to return 1.

If line #1 is deleted, then in some executions a_int will be distinct
to evil_a_int (by allocation nondeterminism), so indeed it will be UB.

>However, if line #1 is deleted, I believe an implementation that prints out equal values and then does not return 1 would be correct: the implementation can claim that a_int != evil_a_int, so that the 'return *evil_a;' had undefined behavior, and therefore it was permitted to do whatever it liked, including printing out values that appeared to be the same.

>I /think/ the upshot is that you *can* optimize away a pointer-to-integer cast if its value is unused, but you cannot optimize one away in the presence of potential data- or control-dependencies on it (including indirect ones such as escaping the pointers through the IO system and comparing them somewhere else, then reading back the result) that might in any way influence a later integer-to-pointer cast. Does that match your intent?

yes

>
> Integer-to-pointer casts seem to have surprising evaluation semantics as a result of this approach. It seems reasonable to me to give cases like this defined behavior:
>
> char buffer[32];
> struct A { int n; };
> struct B : A {};
> struct C : A {};
> A *p = new (buffer) B;
> intptr_t x = (intptr_t)p;
> A *q = new (buffer) C;
> A *r = (A*)x; // ok, r points to the A base of the C object
> r->n = 123; // OK, r points to the new A object not the old one

If we think of an analogous example in which the B and C objects are
heap or stack allocated (with non-overlapping lifetimes) that happen
to get the same address, with the code checking that, then:

> ... but it matters when the cast from integer type to pointer type is performed.

that does indeed make a difference.

>If we move the cast to pointer type earlier, the resulting example would not be defined under the "points within a live object" / launder model:
>
> A *p = new (buffer) B;
> intptr_t x = (intptr_t)p;
> A *r = (A*)x; // ok, r points to the A base of the B object
> A *q = new (buffer) C;
> r->n = 123; // undefined: r points to the old A object that is not within its lifetime
>
> Giving casts between pointer and integer types side-effects, and effects that depend on when they're evaluated, makes me nervous.

As (I think) Jens said, the pointer-to-integer cast side-effect, of
marking the storage instance as exposed, seems like it more-or-less
has to be a temporal thing.
The integer-to-pointer cast *could* be left ambiguous until it's
resolved - but that allows other strange things, eg it can be
(eventually) resolved to an object that didn't even exist earlier.
The temporal view seems (I hope :-) quite easy for programmers to
understand.

>I think it'd be preferable to give them a single-but-unknown provenance, following the "pick whichever single value makes the rest of the program work" model of P0593R3 (notionally pretty similar to C's effective type rule -- you resolve to the first provenance that you use with the pointer). That'd mean casts to pointer type are timeless and freely reorderable, and both the above examples are defined, not only the first one. However, N2363 has a scary example involving guessing the address of a function parameter that is apparently defanged by the time-dependence of integer-to-pointer casts. To what extent is that essential? Is there a different way that guessing a pointer value could be disallowed?

Simple allocation-address nondeterminism disallows pointer value guessing.

thanks,
Peter

>> The user-disambiguation refinement adds some complexity but supports
>> roundtrip casts, from pointer to integer and back, of pointers that
>> are one-past a storage instance.
>>
>>
>>

Received on 2019-04-15 16:16:04