sg12: Re: [ub] C provenance semantics proposal (without pointer equality)

From: Richard Smith <richardsmith_at_[hidden]>
Date: Mon, 15 Apr 2019 16:23:23 -0700

On Mon, Apr 15, 2019 at 2:29 PM Peter Sewell <Peter.Sewell_at_[hidden]>
wrote:

> On 15/04/2019, Richard Smith <richardsmith_at_[hidden]> wrote:
> > On Mon, Apr 15, 2019 at 7:16 AM Peter Sewell <Peter.Sewell_at_[hidden]>
> [...]
> >> We've also heard suggestions that compilers do things here, but in C
> >> the container-of idiom seems pervasive, and WG14 at the last meeting
> >> expressed a large majority that it has to be permitted by the
> >> standard.
> >>
> >
> > This is likely one of the areas where C++ wants to have stricter rules
> than
> > C, at least for some categories of types. (We already have limitations on
> > the types for which offsetof can be used, and don't require offsets to be
> > constant in all cases, and even permit some flavours of structs to not
> > store their members within the memory associated with the struct object
> at
> > all.) We probably want to follow C in this area at least for
> > standard-layout types, though.
>
> y
>
> >> As a consequence of the above, the specification for std::launder says:
> >> >
> >> > """
> >> > Expects: [...] All bytes of storage that would be reachable through
> the
> >> result are reachable through p (see below).
> >> > [...]
> >> > Remarks: [...] A byte of storage is reachable through a pointer value
> >> that points to an object Y if it is within the storage occupied by Y, an
> >> object that is pointer-interconvertible with Y, or the
> >> immediately-enclosing array object if Y is an array element.
> >> > """
> >>
> >> (small thing: you might or might not want that to support movement
> >> through nested arrays)
> >>
> >> > Can this be accommodated by the "recreate the provenance on
> >> integer-to-pointer cast" model? I think it's not accommodated by the
> >> approach you describe above -- the above description would seem to
> >> suggest
> >> that we need to allow access to the entirety of the largest live object
> >> whose address was taken and which encloses the address represented by
> the
> >> integer. So casting an A* to an integer could allow navigation to the
> >> enclosing B, if the B object has also had its address cast to an
> integer.
> >>
> >> Yes. The current proposal allows that. We've not nailed down
> >> subobject issues, but one approach that several of us favour would be
> >> to enforce subobject boundaries except for void* or character-type*
> >> pointer arithmetic - that would let the offsetof container-of idioms
> >> still work, while permitting error detection in cases where (eg)
> >> someone does int* pointer arithmetic to move between struct members.
> >>
> >
> > I think something like that would make sense.
> >
> > C++ has a notion of an array of byte-like type providing storage for
> > another object (with no formal subobject relationship). I think it would
> > make sense to say that all mechanisms by which storage is allocated (of
> all
> > storage durations) implicitly create an array of such a byte-like type
> > covering the entire allocation, so every object either is, or is nested
> > within, some array of byte-like type representing a storage allocation.
> > Then pointer arithmetic on pointers to elements of that enclosing array
> > would permit arbitrary navigation, and otherwise navigation would be
> > constrained to subobject relationships.
>
> y (Jens G has a proposal for C along those lines, though as I say
> we've not yet actually done the subobject bits yet)
>
> >> Personally I'm not sure how far we should go to defend this
> "irreversible
> >> subobject navigation" property, but my implementation isn't one that
> >> takes
> >> advantage of it.
> >> >
> >> >> A storage instance is deemed exposed by a cast of a pointer to it to
> >> >> an integer type, by a read (at non-pointer type) of the
> representation
> >> >> of the pointer, or by an output of the pointer using %p.
> >> >
> >> >
> >> > Hmm. Does that mean that evaluation of a pointer-to-integer cast has a
> >> side-effect, and cannot in general be optimized away even if its result
> >> is
> >> unused?
> >>
> >> In the source language, yes. An intermediate language might perhaps
> >> use a more liberal semantics that doesn't rely on that side effect.
> >>
> >> >Or is there an assumption being built in here that the only way to form
> >> an equal integer value to cast back to a pointer will necessarily
> involve
> >> a
> >> computation that actually depends on the pointer-to-integer cast, even
> if
> >> we don't explicitly require that?
> >>
> >> That will typically be true - that's what allocation-address
> >> nondeterminism buys you.
> >>
> >> > How careful do we need to be in future to avoid breaking that
> >> assumption? Consider case such as:
> >> >
> >> > int f(int mode, int offset = 0) {
> >> > int a, b;
> >> > if (mode == 1) { return (intptr_t)&a - (intptr_t)&b; }
> >> > intptr_t a_int = (intptr_t)&a;
> >> > intptr_t evil_a_int = (intptr_t)&b + offset;
> >> > int *evil_a = (int*)evil_a_int;
> >> > printf("%" PRIdPTR " %" PRIdPTR " %p %p\n", a_int, evil_a_int, &a,
> >> evil_a);
> >> > if (getchar() == 'x') std::terminate(); // #1, allow user to abort
> if
> >> integer/pointer values differ
> >> > a = 1;
> >> > return *evil_a;
> >> > }
> >> > int main() { return f(2, f(1)); }
> >>
> >> The precise interaction of UB with user input is another can of worms
> >> that we've not yet really opened :-) But, assuming that programs
> >> can rely on facts about user input (perhaps based on whatever has been
> >> output), which seems reasonable, and that here the user is "required"
> >> to press x if the values differ, then
> >>
> >> > Is my implementation conforming if it prints out equal integer and
> >> pointer values here, and yet (after the program is resumed by the user)
> >> main doesn't return 1? Based on your description, I think the answer is
> >> no,
> >> which means that my weird pointer comparison machine (including a human
> >> component) is effectively enough to keep the pointer exposed.
> >>
> >> As Martin said, the mode==1 execution of f isn't very interesting, as
> >> the second allocations of a and b could be arbitrarily spaced. But
> >> then in the mode!=1 execution, a and b are both exposed, so the cast
> >> (int*)evil_a_int will give usable provenance iff the passed-in offset
> >> is correct, so if the user is required to check that, this should be
> >> well-defined to return 1.
> >>
> >> If line #1 is deleted, then in some executions a_int will be distinct
> >> to evil_a_int (by allocation nondeterminism), so indeed it will be UB.
> >>
> >> >However, if line #1 is deleted, I believe an implementation that prints
> >> out equal values and then does not return 1 would be correct: the
> >> implementation can claim that a_int != evil_a_int, so that the 'return
> >> *evil_a;' had undefined behavior, and therefore it was permitted to do
> >> whatever it liked, including printing out values that appeared to be the
> >> same.
> >>
> >> >I /think/ the upshot is that you *can* optimize away a
> >> > pointer-to-integer
> >> cast if its value is unused, but you cannot optimize one away in the
> >> presence of potential data- or control-dependencies on it (including
> >> indirect ones such as escaping the pointers through the IO system and
> >> comparing them somewhere else, then reading back the result) that might
> >> in
> >> any way influence a later integer-to-pointer cast. Does that match your
> >> intent?
> >>
> >> yes
> >>
> >> >
> >> > Integer-to-pointer casts seem to have surprising evaluation semantics
> >> > as
> >> a result of this approach. It seems reasonable to me to give cases like
> >> this defined behavior:
> >> >
> >> > char buffer[32];
> >> > struct A { int n; };
> >> > struct B : A {};
> >> > struct C : A {};
> >> > A *p = new (buffer) B;
> >> > intptr_t x = (intptr_t)p;
> >> > A *q = new (buffer) C;
> >> > A *r = (A*)x; // ok, r points to the A base of the C object
> >> > r->n = 123; // OK, r points to the new A object not the old one
> >>
> >> If we think of an analogous example in which the B and C objects are
> >> heap or stack allocated (with non-overlapping lifetimes) that happen
> >> to get the same address, with the code checking that, then:
> >>
> >> > ... but it matters when the cast from integer type to pointer type is
> >> performed.
> >>
> >> that does indeed make a difference.
> >>
> >> >If we move the cast to pointer type earlier, the resulting example
> would
> >> not be defined under the "points within a live object" / launder model:
> >> >
> >> > A *p = new (buffer) B;
> >> > intptr_t x = (intptr_t)p;
> >> > A *r = (A*)x; // ok, r points to the A base of the B object
> >> > A *q = new (buffer) C;
> >> > r->n = 123; // undefined: r points to the old A object that is not
> >> within its lifetime
> >> >
> >> > Giving casts between pointer and integer types side-effects, and
> >> > effects
> >> that depend on when they're evaluated, makes me nervous.
> >>
> >> As (I think) Jens said, the pointer-to-integer cast side-effect, of
> >> marking the storage instance as exposed, seems like it more-or-less
> >> has to be a temporal thing.
> >>
> >
> > If I'm understanding the model correctly, I think that depends on your
> > perspective.
> >
> > If I understand correctly, a pointer-to-integer cast is only temporal to
> > the extent that there must be something "forcing" it to happen before a
> > integer-to-pointer cast that depends upon it exposing an object.
> (Ideally,
> > I think we'd like to specify that the integer-to-pointer cast has a
> > dependency on the pointer-to-integer cast, but that's a major can of
> worms
> > comparable to the consume memory order.)
>
> (quite. let's not :-)
>
> > If we model a program execution as
> > a set of possible executions (with defined behavior only if all possible
> > executions have defined behavior), and model pointers as having
> > nondeterministic corresponding integer values, then a pointer-to-integer
> > cast (along with the other mechanisms that expose pointers as integers)
> can
> > be viewed as a pure mathematical function (and in particular, it has no
> > temporal dependence nor side-effects), but it's an oracle that exposes
> > information that is not observable in any other way -- and we can see
> that
> > a program that never uses the oracle cannot possibly correctly "guess"
> the
> > integer corresponding to a pointer[1][2], so any integer-to-pointer
> > conversion must necessarily introduce a possible execution with undefined
> > behavior.
> > [1]: The oracle is the only way to determine the relevant information
> > about which execution in the set of possible executions is currently
> > occurring.
>
> All that's a pretty exactly description of the PNVI-plain variant.
> PNVI-ae-udi keeps the result of the pointer-to-integer cast pure, but
> adds the "make this storage instance exposed" side effect. And that
> makes a difference only to the integer-to-pointer casts one can do.
>
> (treating reads of representation bytes and suchlike as similar to p-to-i
> casts,
> and accesses via pointers that have, for example, had some of their
> bytes written via char* pointers, as similar to i-to-p casts)
>
> > [2]: In principle, a program could keep allocating memory and casting
> > pointers to integers until it's seen all integer values except one, and
> > then correctly guess the address of the remaining object. But an
> > implementation can prevent that by refusing to allocate all addressable
> > memory.
>
> y. This is the concern that pushed Juneyoung Lee et al. into their twin
> allocation model. But instead, for a source language, we can just
> limit attention to programs that never almost-exhaust memory
> (ie that leave space for a copy of their biggest (and suitably
> aligned) allocation).
>
>
>
> >
> >> The integer-to-pointer cast *could* be left ambiguous until it's
> >> resolved - but that allows other strange things, eg it can be
> >> (eventually) resolved to an object that didn't even exist earlier.
> >> The temporal view seems (I hope :-) quite easy for programmers to
> >> understand.
> >>
> >
> > I think it's surprising either way -- either you find that a cast from
> > integer to pointer has an execution side-effect, and it matters where you
> > write it, or you find that you can cast an integer into a pointer to an
> > object that doesn't yet exist at the point of the cast.
>
> true
>
> > I think the former is surprising for practicing programmers, whereas the
> > latter is likely only surprising for language lawyers.
>
> hmm - the latter would be annoyingly complex to specify (like the "udi"
> part of PNVI-ae-udi but worse, as one gradually accumulates constraints
> on what the result of such a cast might be pointing to, based on
> what arithmetic is done to the pointer). It's hard to imagine that
> becoming
> widely understood...
>

Under p0593r3
<http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p0593r3.html>
(recently approved by WG21's Evolution Working Group and on its way towards
C++20), a similar "pick the answer that gives the program defined behavior"
rule will already be in use in C++ for other cases. That's not to say that
that means it'll be understood, but we will at least have precedent.

> Moreover, I'd expect
> > the latter to be the model that implementations actually use,
>
> Can you expand on that? Not sure I understand.
>

Here's an example in C++ where implementations might care about which
object a pointer points to:

struct A { virtual void f() = 0; };
struct B : A { void f() override; };
struct C : A { void f() override; };
alignas(B, C) std::byte storage[std::max(sizeof(B), sizeof(C))];
A *p = new (storage) B;
intptr_t n = (intptr_t)p;
A *q = (A*)n; // #0
new (storage) C;
q->f(); // #1
q->f(); // #2

The C++ rules allow us to assume that the two q->f() calls call the same
function: the same pointer value is used, so the pointer must point to an
object with the same dynamic type. This permits us to remove a redundant
vptr load in line #2.

Under the temporal model, an implementation could go further and prove that
there's a B object within its lifetime at the point of the
integer-to-pointer cast in line #0, and thereby decide that #1 and #2 both
call B::f instead of C::f. I'm suggesting that implementations aren't going
to do that, and that instead they'll only make the more conservative
assumption that #1 and #2 load the same vptr value without assuming what
that value is.

More broadly, I would expect the implementation model actually used for
pointer-to-int conversions and int-to-pointer conversions to be based on (a
conservative approximation to) determining on which pointer-to-integer
conversions a given integer-to-pointer conversion is value- or
control-dependent, rather than giving either conversion side-effects in the
intermediate representation.

> so the
> > difference between the two will become a theoretical gotcha and we'll end
> > up with theoretically non-portable code relying on such casts being
> > timeless.
> >
> > Regarding the pointer_from_integer_1pg.c example, I think we could
> address
> > that case in a somewhat different way. Currently in C++ we have a rule
> that
> > says:
> >
> > """
> > When the end of the duration of a region of storage is reached, the
> values
> > of all pointers representing the address of any part of that region of
> > storage become invalid pointer values (6.7.2).
> > """
> >
> > and similarly in C:
> >
> > """
> > The value of a pointer becomes indeterminate when the object it points to
> > (or just past) reaches the end of its lifetime.
> > """
>
> Yes, although there is currently a move for C to at least partially remove
> that,
> as there are a bunch of longstanding concurrent algorithms that depend on
> == comparison with pointers to lifetime-ended objects. Not sure what will
> happen there.
>
> > (I'm ignoring the memory model problems with the use of "become[s]"
> here.)
> > It would seem natural to extend this so it applies at both the point of
> > allocation and the point of deallocation. Then it's not the cast to
> pointer
> > that has temporal behavior; rather, it's the allocation of the
> > automatic-storage-duration variable that causes there to be no valid
> > pointers into that region of storage, and changes the value of 'p' in the
> > example to an invalid/indeterminate pointer value. (It still ends up
> > mattering whether you perform the cast inside or outside the function,
> but
> > for a different reason.)
>
> That's exotic - never thought of that possibility. But I would like to
> have a
> semantics that doesn't rely on lifetime end-zap if we can.
>
> > This might be equivalent to integer-to-pointer casts being temporal in C
> > (because objects are by definition the same as the storage regions they
> > occupy), but not in C++.
> >
> >>I think it'd be preferable to give them a single-but-unknown provenance,
> >> following the "pick whichever single value makes the rest of the program
> >> work" model of P0593R3 (notionally pretty similar to C's effective type
> >> rule -- you resolve to the first provenance that you use with the
> >> pointer).
>
> If an access is first, that's not too complex, but if pointer arithmetic
> happens
> first, it gets messy to record the resulting constraints. And we get a
> lot more
> instantaneous action-at-a-distance as these get resolved.
>
> >> That'd mean casts to pointer type are timeless and freely reorderable,
> >> and
> >> both the above examples are defined, not only the first one. However,
> >> N2363
> >> has a scary example involving guessing the address of a function
> >> parameter
> >> that is apparently defanged by the time-dependence of integer-to-pointer
> >> casts. To what extent is that essential? Is there a different way that
> >> guessing a pointer value could be disallowed?
> >>
> >> Simple allocation-address nondeterminism disallows pointer value
> >> guessing.
> >>
> >> thanks,
> >> Peter
> >>
> >>
> >> >> The user-disambiguation refinement adds some complexity but supports
> >> >> roundtrip casts, from pointer to integer and back, of pointers that
> >> >> are one-past a storage instance.
> >> >>
> >> >>
> >> >>
> >>
> >
>
> best,
> Peter
>

Received on 2019-04-16 01:23:38