sg12: Re: [ub] C provenance semantics proposal (without pointer equality)

From: Richard Smith <richardsmith_at_[hidden]>
Date: Mon, 15 Apr 2019 13:13:46 -0700

On Mon, Apr 15, 2019 at 7:16 AM Peter Sewell <Peter.Sewell_at_[hidden]>
wrote:

> Hi Richard,
>
> On Sat, 13 Apr 2019 at 02:45, Richard Smith <richardsmith_at_[hidden]>
> wrote:
> >
> > On Fri, Apr 12, 2019 at 5:09 AM Peter Sewell <Peter.Sewell_at_[hidden]>
> wrote:
> >>
> >> Hi all,
> >>
> >> perhaps I can reset this discussion, which got bogged down in largely
> >> independent questions about pointer equality, by summarising the
> >> basic provenance proposal. Try this below, extracted from n2363.
> >> Comments on this?
> >
> >
> > Thank you for prompting this discussion, and for the huge work that's
> gone into N2363.
> >
> >>
> >> best,
> >> Peter
> >>
> >>
> >>
> >> C pointer values are typically represented at runtime as simple
> >> concrete numeric values, but mainstream compilers routinely exploit
> >> information about the "provenance" of pointers to reason that they
> >> cannot alias, and hence to justify optimisations. This is
> >> long-standing practice, but exactly what it means (what programmers
> >> can rely on, and what provenance-based alias analysis is allowed to
> >> do), has never been nailed down.
> >
> >
> > Not in C, but C++ has at least some rules for this already (although
> they are incomplete).
> >
> >> That's what the proposal does.
> >>
> >>
> >> The basic idea is to associate a *provenance* with every pointer
> >> value, identifying the original storage instance (or allocation, in
> >> other words) that the pointer is derived from. In more detail:
> >>
> >> - We take abstract-machine pointer values to be pairs (pi,a), adding a
> >> provenance pi, either @i where i is a storage instance ID, or the
> >> *empty* provenance, to their concrete address a.
> >>
> >> - On every storage instance creation (of objects with static, thread,
> >> automatic, and allocated storage duration), the abstract machine
> >> nondeterministically chooses a fresh storage instance ID i (unique
> >> across the entire execution), and the resulting pointer value
> >> carries that single storage instance ID as its provenance @i.
> >
> >
> > C++'s rules are somewhat stronger than this. The "provenance"
> information we use is associated with an object, not merely with a region
> of storage. For example, given:
> >
> > char buffer[32];
> > struct A { int n; };
> > struct B : A {};
> > struct C : A {};
> > A *p = new (buffer) B;
> > A *q = new (buffer) C;
> >
> > ... the "provenance" associated with p and q are different. In C++'s
> terminology, p and q point to two different A objects (with non-overlapping
> lifetime; reuse of the storage containing *p results in the end of the
> lifetime of the *p object) that occupy the same storage (at different
> times).
>
> Y. In C, of course, we don't have these handy syntactic markers of
> when an "object" is created within a malloc'd region.
>
> >> - Provenance is preserved by pointer arithmetic that adds or subtracts
> >> an integer to a pointer.
> >>
> >> - At any access via a pointer value, its numeric address must be
> >> consistent with its provenance, with undefined behaviour
> >> otherwise.
> >
> >
> > In C++ terms, an object can only be accessed within its lifetime.
> >
> >>
> >> In particular:
> >>
> >> -- access via a pointer value which has provenance a single storage
> >> instance ID @i must be within the memory footprint of the
> >> corresponding original storage instance, which must still be
> >> live.
> >>
> >> -- all other accesses, including those via a pointer value with
> >> empty provenance, are undefined behaviour.
> >>
> >> Regarding such accesses as undefined behaviour is necessary to make
> >> optimisation based on provenance alias analysis sound: if the standard
> >> did define behaviour for programs that make provenance-violating
> >> accesses, e.g.~by adopting a concrete semantics, optimisation based on
> >> provenance-aware alias analysis would not be sound. In other words,
> >> the provenance lets one distinguish a one-past pointer from a pointer
> >> to the start of an adjacently-allocated object, which otherwise are
> >> indistinguishable.
> >>
> >> All this is for the C abstract machine as defined in the standard:
> >> compilers might rely on provenance in their alias analysis and
> >> optimisation, but one would not expect normal implementations to
> >> record or manipulate provenance at runtime (though dynamic or static
> >> analysis tools might).
> >>
> >>
> >> Then, to support low-level systems programming, C provides many other
> >> ways to construct and manipulate pointer values:
> >
> >
> > I think this is this point where we're going beyond what C++ currently
> specifies.
> >
> >>
> >> - casts of pointers to integer types and back, possibly with integer
> >> arithmetic, e.g.~to force alignment, or to store information in
> >> unused bits of pointers;
> >>
> >> - copying pointer values with memcpy;
> >>
> >> - manipulation of the representation bytes of pointers, e.g.~via user
> >> code that copies them via char* or unsigned char* accesses;
> >>
> >> - type punning between pointer and integer values;
> >>
> >> - I/O, using either fprintf/fscanf and the %p format, fwrite/fread on
> >> the pointer representation bytes, or pointer/integer casts and
> >> integer I/O;
> >>
> >> - copying pointer values with realloc; and
> >>
> >> - constructing pointer values that embody knowledge established from
> >> linking, and from constants that represent the addresses of
> >> memory-mapped devices.
> >>
> >>
> >> A satisfactory semantics has to address all these, together with the
> >> implications on optimisation. We've explored several, but our main
> >> proposal is "PNVI-ae-udi" (provenance not via integers,
> >> address-exposed, user-disambiguation).
> >>
> >> This semantics does not track provenance via integers. Instead, at
> >> integer-to-pointer cast points, it checks whether the given address
> >> points within a live object that has previously been *exposed* and, if
> >> so, recreates the corresponding provenance.
> >
> >
> > For C++, this is almost equivalent to saying that an implicit 'launder'
> is performed on an integer-to-pointer cast. However, this is not a complete
> answer, because we first need to form a pointer value before we launder,
> and it matters which pointer value we form. Consider:
> >
> > struct A { int x; };
> > struct B { int y; A a; };
> > void f(A *a);
> > void g() {
> > B b = {1, 2};
> > f(&b.a);
> > return b.y;
> > }
> >
> > In C++, it is valid to transform 'g' in the above program to:
> >
> > void g() {
> > A a = {2};
> > f(&a);
> > return 1;
> > }
> >
> > ... because we carefully ensure that there is no defined way to navigate
> from an A* to the B* containing it. I'm told that some compilers actually
> perform optimizations in this area, at least in some cases. When we were
> formulating the C++ rules, there was vendor push-back on adding any defined
> mechanism that would permit navigation from an A* to its enclosing B*.
>
> We've also heard suggestions that compilers do things here, but in C
> the container-of idiom seems pervasive, and WG14 at the last meeting
> expressed a large majority that it has to be permitted by the
> standard.
>

This is likely one of the areas where C++ wants to have stricter rules than
C, at least for some categories of types. (We already have limitations on
the types for which offsetof can be used, and don't require offsets to be
constant in all cases, and even permit some flavours of structs to not
store their members within the memory associated with the struct object at
all.) We probably want to follow C in this area at least for
standard-layout types, though.

> As a consequence of the above, the specification for std::launder says:
> >
> > """
> > Expects: [...] All bytes of storage that would be reachable through the
> result are reachable through p (see below).
> > [...]
> > Remarks: [...] A byte of storage is reachable through a pointer value
> that points to an object Y if it is within the storage occupied by Y, an
> object that is pointer-interconvertible with Y, or the
> immediately-enclosing array object if Y is an array element.
> > """
>
> (small thing: you might or might not want that to support movement
> through nested arrays)
>
> > Can this be accommodated by the "recreate the provenance on
> integer-to-pointer cast" model? I think it's not accommodated by the
> approach you describe above -- the above description would seem to suggest
> that we need to allow access to the entirety of the largest live object
> whose address was taken and which encloses the address represented by the
> integer. So casting an A* to an integer could allow navigation to the
> enclosing B, if the B object has also had its address cast to an integer.
>
> Yes. The current proposal allows that. We've not nailed down
> subobject issues, but one approach that several of us favour would be
> to enforce subobject boundaries except for void* or character-type*
> pointer arithmetic - that would let the offsetof container-of idioms
> still work, while permitting error detection in cases where (eg)
> someone does int* pointer arithmetic to move between struct members.
>

I think something like that would make sense.

C++ has a notion of an array of byte-like type providing storage for
another object (with no formal subobject relationship). I think it would
make sense to say that all mechanisms by which storage is allocated (of all
storage durations) implicitly create an array of such a byte-like type
covering the entire allocation, so every object either is, or is nested
within, some array of byte-like type representing a storage allocation.
Then pointer arithmetic on pointers to elements of that enclosing array
would permit arbitrary navigation, and otherwise navigation would be
constrained to subobject relationships.

> Personally I'm not sure how far we should go to defend this "irreversible
> subobject navigation" property, but my implementation isn't one that takes
> advantage of it.
> >
> >> A storage instance is deemed exposed by a cast of a pointer to it to
> >> an integer type, by a read (at non-pointer type) of the representation
> >> of the pointer, or by an output of the pointer using %p.
> >
> >
> > Hmm. Does that mean that evaluation of a pointer-to-integer cast has a
> side-effect, and cannot in general be optimized away even if its result is
> unused?
>
> In the source language, yes. An intermediate language might perhaps
> use a more liberal semantics that doesn't rely on that side effect.
>
> >Or is there an assumption being built in here that the only way to form
> an equal integer value to cast back to a pointer will necessarily involve a
> computation that actually depends on the pointer-to-integer cast, even if
> we don't explicitly require that?
>
> That will typically be true - that's what allocation-address
> nondeterminism buys you.
>
> > How careful do we need to be in future to avoid breaking that
> assumption? Consider case such as:
> >
> > int f(int mode, int offset = 0) {
> > int a, b;
> > if (mode == 1) { return (intptr_t)&a - (intptr_t)&b; }
> > intptr_t a_int = (intptr_t)&a;
> > intptr_t evil_a_int = (intptr_t)&b + offset;
> > int *evil_a = (int*)evil_a_int;
> > printf("%" PRIdPTR " %" PRIdPTR " %p %p\n", a_int, evil_a_int, &a,
> evil_a);
> > if (getchar() == 'x') std::terminate(); // #1, allow user to abort if
> integer/pointer values differ
> > a = 1;
> > return *evil_a;
> > }
> > int main() { return f(2, f(1)); }
>
> The precise interaction of UB with user input is another can of worms
> that we've not yet really opened :-) But, assuming that programs
> can rely on facts about user input (perhaps based on whatever has been
> output), which seems reasonable, and that here the user is "required"
> to press x if the values differ, then
>
> > Is my implementation conforming if it prints out equal integer and
> pointer values here, and yet (after the program is resumed by the user)
> main doesn't return 1? Based on your description, I think the answer is no,
> which means that my weird pointer comparison machine (including a human
> component) is effectively enough to keep the pointer exposed.
>
> As Martin said, the mode==1 execution of f isn't very interesting, as
> the second allocations of a and b could be arbitrarily spaced. But
> then in the mode!=1 execution, a and b are both exposed, so the cast
> (int*)evil_a_int will give usable provenance iff the passed-in offset
> is correct, so if the user is required to check that, this should be
> well-defined to return 1.
>
> If line #1 is deleted, then in some executions a_int will be distinct
> to evil_a_int (by allocation nondeterminism), so indeed it will be UB.
>
> >However, if line #1 is deleted, I believe an implementation that prints
> out equal values and then does not return 1 would be correct: the
> implementation can claim that a_int != evil_a_int, so that the 'return
> *evil_a;' had undefined behavior, and therefore it was permitted to do
> whatever it liked, including printing out values that appeared to be the
> same.
>
> >I /think/ the upshot is that you *can* optimize away a pointer-to-integer
> cast if its value is unused, but you cannot optimize one away in the
> presence of potential data- or control-dependencies on it (including
> indirect ones such as escaping the pointers through the IO system and
> comparing them somewhere else, then reading back the result) that might in
> any way influence a later integer-to-pointer cast. Does that match your
> intent?
>
> yes
>
> >
> > Integer-to-pointer casts seem to have surprising evaluation semantics as
> a result of this approach. It seems reasonable to me to give cases like
> this defined behavior:
> >
> > char buffer[32];
> > struct A { int n; };
> > struct B : A {};
> > struct C : A {};
> > A *p = new (buffer) B;
> > intptr_t x = (intptr_t)p;
> > A *q = new (buffer) C;
> > A *r = (A*)x; // ok, r points to the A base of the C object
> > r->n = 123; // OK, r points to the new A object not the old one
>
> If we think of an analogous example in which the B and C objects are
> heap or stack allocated (with non-overlapping lifetimes) that happen
> to get the same address, with the code checking that, then:
>
> > ... but it matters when the cast from integer type to pointer type is
> performed.
>
> that does indeed make a difference.
>
> >If we move the cast to pointer type earlier, the resulting example would
> not be defined under the "points within a live object" / launder model:
> >
> > A *p = new (buffer) B;
> > intptr_t x = (intptr_t)p;
> > A *r = (A*)x; // ok, r points to the A base of the B object
> > A *q = new (buffer) C;
> > r->n = 123; // undefined: r points to the old A object that is not
> within its lifetime
> >
> > Giving casts between pointer and integer types side-effects, and effects
> that depend on when they're evaluated, makes me nervous.
>
> As (I think) Jens said, the pointer-to-integer cast side-effect, of
> marking the storage instance as exposed, seems like it more-or-less
> has to be a temporal thing.
>

If I'm understanding the model correctly, I think that depends on your
perspective.

If I understand correctly, a pointer-to-integer cast is only temporal to
the extent that there must be something "forcing" it to happen before a
integer-to-pointer cast that depends upon it exposing an object. (Ideally,
I think we'd like to specify that the integer-to-pointer cast has a
dependency on the pointer-to-integer cast, but that's a major can of worms
comparable to the consume memory order.) If we model a program execution as
a set of possible executions (with defined behavior only if all possible
executions have defined behavior), and model pointers as having
nondeterministic corresponding integer values, then a pointer-to-integer
cast (along with the other mechanisms that expose pointers as integers) can
be viewed as a pure mathematical function (and in particular, it has no
temporal dependence nor side-effects), but it's an oracle that exposes
information that is not observable in any other way -- and we can see that
a program that never uses the oracle cannot possibly correctly "guess" the
integer corresponding to a pointer[1][2], so any integer-to-pointer
conversion must necessarily introduce a possible execution with undefined
behavior.

[1]: The oracle is the only way to determine the relevant information
about which execution in the set of possible executions is currently
occurring.
[2]: In principle, a program could keep allocating memory and casting
pointers to integers until it's seen all integer values except one, and
then correctly guess the address of the remaining object. But an
implementation can prevent that by refusing to allocate all addressable
memory.

> The integer-to-pointer cast *could* be left ambiguous until it's
> resolved - but that allows other strange things, eg it can be
> (eventually) resolved to an object that didn't even exist earlier.
> The temporal view seems (I hope :-) quite easy for programmers to
> understand.
>

I think it's surprising either way -- either you find that a cast from
integer to pointer has an execution side-effect, and it matters where you
write it, or you find that you can cast an integer into a pointer to an
object that doesn't yet exist at the point of the cast.

I think the former is surprising for practicing programmers, whereas the
latter is likely only surprising for language lawyers. Moreover, I'd expect
the latter to be the model that implementations actually use, so the
difference between the two will become a theoretical gotcha and we'll end
up with theoretically non-portable code relying on such casts being
timeless.

Regarding the pointer_from_integer_1pg.c example, I think we could address
that case in a somewhat different way. Currently in C++ we have a rule that
says:

"""
When the end of the duration of a region of storage is reached, the values
of all pointers representing the address of any part of that region of
storage become invalid pointer values (6.7.2).
"""

and similarly in C:

"""
The value of a pointer becomes indeterminate when the object it points to
(or just past) reaches the end of its lifetime.
"""

(I'm ignoring the memory model problems with the use of "become[s]" here.)
It would seem natural to extend this so it applies at both the point of
allocation and the point of deallocation. Then it's not the cast to pointer
that has temporal behavior; rather, it's the allocation of the
automatic-storage-duration variable that causes there to be no valid
pointers into that region of storage, and changes the value of 'p' in the
example to an invalid/indeterminate pointer value. (It still ends up
mattering whether you perform the cast inside or outside the function, but
for a different reason.)

This might be equivalent to integer-to-pointer casts being temporal in C
(because objects are by definition the same as the storage regions they
occupy), but not in C++.

>I think it'd be preferable to give them a single-but-unknown provenance,
> following the "pick whichever single value makes the rest of the program
> work" model of P0593R3 (notionally pretty similar to C's effective type
> rule -- you resolve to the first provenance that you use with the pointer).
> That'd mean casts to pointer type are timeless and freely reorderable, and
> both the above examples are defined, not only the first one. However, N2363
> has a scary example involving guessing the address of a function parameter
> that is apparently defanged by the time-dependence of integer-to-pointer
> casts. To what extent is that essential? Is there a different way that
> guessing a pointer value could be disallowed?
>
> Simple allocation-address nondeterminism disallows pointer value guessing.
>
> thanks,
> Peter
>
>
> >> The user-disambiguation refinement adds some complexity but supports
> >> roundtrip casts, from pointer to integer and back, of pointers that
> >> are one-past a storage instance.
> >>
> >>
> >>
>

Received on 2019-04-15 22:14:00