sg12: Re: [ub] C provenance semantics proposal (without pointer equality)

From: Richard Smith <richardsmith_at_[hidden]>
Date: Fri, 12 Apr 2019 18:45:12 -0700

On Fri, Apr 12, 2019 at 5:09 AM Peter Sewell <Peter.Sewell_at_[hidden]>
wrote:

> Hi all,
>
> perhaps I can reset this discussion, which got bogged down in largely
> independent questions about pointer equality, by summarising the
> basic provenance proposal. Try this below, extracted from n2363.
> Comments on this?
>

Thank you for prompting this discussion, and for the huge work that's gone
into N2363.

> best,
> Peter
>
>
>
> C pointer values are typically represented at runtime as simple
> concrete numeric values, but mainstream compilers routinely exploit
> information about the "provenance" of pointers to reason that they
> cannot alias, and hence to justify optimisations. This is
> long-standing practice, but exactly what it means (what programmers
> can rely on, and what provenance-based alias analysis is allowed to
> do), has never been nailed down.

Not in C, but C++ has at least some rules for this already (although they
are incomplete).

That's what the proposal does.
>
>
> The basic idea is to associate a *provenance* with every pointer
> value, identifying the original storage instance (or allocation, in
> other words) that the pointer is derived from. In more detail:
>
> - We take abstract-machine pointer values to be pairs (pi,a), adding a
> provenance pi, either @i where i is a storage instance ID, or the
> *empty* provenance, to their concrete address a.
>
> - On every storage instance creation (of objects with static, thread,
> automatic, and allocated storage duration), the abstract machine
> nondeterministically chooses a fresh storage instance ID i (unique
> across the entire execution), and the resulting pointer value
> carries that single storage instance ID as its provenance @i.
>

C++'s rules are somewhat stronger than this. The "provenance" information
we use is associated with an object, not merely with a region of storage.
For example, given:

char buffer[32];
struct A { int n; };
struct B : A {};
struct C : A {};
A *p = new (buffer) B;
A *q = new (buffer) C;

... the "provenance" associated with p and q are different. In C++'s
terminology, p and q point to two different A objects (with non-overlapping
lifetime; reuse of the storage containing *p results in the end of the
lifetime of the *p object) that occupy the same storage (at different
times).

- Provenance is preserved by pointer arithmetic that adds or subtracts
> an integer to a pointer.
>
> - At any access via a pointer value, its numeric address must be
> consistent with its provenance, with undefined behaviour
> otherwise.

In C++ terms, an object can only be accessed within its lifetime.

> In particular:
>
> -- access via a pointer value which has provenance a single storage
> instance ID @i must be within the memory footprint of the
> corresponding original storage instance, which must still be
> live.
>
> -- all other accesses, including those via a pointer value with
> empty provenance, are undefined behaviour.
>
> Regarding such accesses as undefined behaviour is necessary to make
> optimisation based on provenance alias analysis sound: if the standard
> did define behaviour for programs that make provenance-violating
> accesses, e.g.~by adopting a concrete semantics, optimisation based on
> provenance-aware alias analysis would not be sound. In other words,
> the provenance lets one distinguish a one-past pointer from a pointer
> to the start of an adjacently-allocated object, which otherwise are
> indistinguishable.
>
> All this is for the C abstract machine as defined in the standard:
> compilers might rely on provenance in their alias analysis and
> optimisation, but one would not expect normal implementations to
> record or manipulate provenance at runtime (though dynamic or static
> analysis tools might).
>
>
> Then, to support low-level systems programming, C provides many other
> ways to construct and manipulate pointer values:
>

I think this is this point where we're going beyond what C++ currently
specifies.

> - casts of pointers to integer types and back, possibly with integer
> arithmetic, e.g.~to force alignment, or to store information in
> unused bits of pointers;
>
> - copying pointer values with memcpy;
>
> - manipulation of the representation bytes of pointers, e.g.~via user
> code that copies them via char* or unsigned char* accesses;
>
> - type punning between pointer and integer values;
>
> - I/O, using either fprintf/fscanf and the %p format, fwrite/fread on
> the pointer representation bytes, or pointer/integer casts and
> integer I/O;
>
> - copying pointer values with realloc; and
>
> - constructing pointer values that embody knowledge established from
> linking, and from constants that represent the addresses of
> memory-mapped devices.
>
>
> A satisfactory semantics has to address all these, together with the
> implications on optimisation. We've explored several, but our main
> proposal is "PNVI-ae-udi" (provenance not via integers,
> address-exposed, user-disambiguation).
>
> This semantics does not track provenance via integers. Instead, at
> integer-to-pointer cast points, it checks whether the given address
> points within a live object that has previously been *exposed* and, if
> so, recreates the corresponding provenance.
>

For C++, this is almost equivalent to saying that an implicit 'launder' is
performed on an integer-to-pointer cast. However, this is not a complete
answer, because we first need to form a pointer value before we launder,
and it matters which pointer value we form. Consider:

struct A { int x; };
struct B { int y; A a; };
void f(A *a);
void g() {
  B b = {1, 2};
  f(&b.a);
  return b.y;
}

In C++, it is valid to transform 'g' in the above program to:

void g() {
  A a = {2};
  f(&a);
  return 1;
}

... because we carefully ensure that there is no defined way to navigate
from an A* to the B* containing it. I'm told that some compilers actually
perform optimizations in this area, at least in some cases. When we were
formulating the C++ rules, there was vendor push-back on adding any defined
mechanism that would permit navigation from an A* to its enclosing B*.

As a consequence of the above, the specification for std::launder says:

"""
Expects: [...] All bytes of storage that would be reachable through the
result are reachable through p (see below).
[...]
Remarks: [...] A byte of storage is reachable through a pointer value that
points to an object Y if it is within the storage occupied by Y, an object
that is pointer-interconvertible with Y, or the immediately-enclosing array
object if Y is an array element.
"""

Can this be accommodated by the "recreate the provenance on
integer-to-pointer cast" model? I think it's not accommodated by the
approach you describe above -- the above description would seem to suggest
that we need to allow access to the entirety of the largest live object
whose address was taken and which encloses the address represented by the
integer. So casting an A* to an integer could allow navigation to the
enclosing B, if the B object has also had its address cast to an integer.

Personally I'm not sure how far we should go to defend this "irreversible
subobject navigation" property, but my implementation isn't one that takes
advantage of it.

A storage instance is deemed exposed by a cast of a pointer to it to
> an integer type, by a read (at non-pointer type) of the representation
> of the pointer, or by an output of the pointer using %p.
>

Hmm. Does that mean that evaluation of a pointer-to-integer cast has a
side-effect, and cannot in general be optimized away even if its result is
unused? Or is there an assumption being built in here that the only way to
form an equal integer value to cast back to a pointer will necessarily
involve a computation that actually depends on the pointer-to-integer cast,
even if we don't explicitly require that? How careful do we need to be in
future to avoid breaking that assumption? Consider case such as:

int f(int mode, int offset = 0) {
  int a, b;
  if (mode == 1) { return (intptr_t)&a - (intptr_t)&b; }
  intptr_t a_int = (intptr_t)&a;
  intptr_t evil_a_int = (intptr_t)&b + offset;
  int *evil_a = (int*)evil_a_int;
  printf("%" PRIdPTR " %" PRIdPTR " %p %p\n", a_int, evil_a_int, &a,
evil_a);
  if (getchar() == 'x') std::terminate(); // #1, allow user to abort if
integer/pointer values differ
  a = 1;
  return *evil_a;
}
int main() { return f(2, f(1)); }

Is my implementation conforming if it prints out equal integer and pointer
values here, and yet (after the program is resumed by the user) main
doesn't return 1? Based on your description, I think the answer is no,
which means that my weird pointer comparison machine (including a human
component) is effectively enough to keep the pointer exposed. However, if
line #1 is deleted, I believe an implementation that prints out equal
values and then does not return 1 would be correct: the implementation can
claim that a_int != evil_a_int, so that the 'return *evil_a;' had undefined
behavior, and therefore it was permitted to do whatever it liked, including
printing out values that appeared to be the same. I /think/ the upshot is
that you *can* optimize away a pointer-to-integer cast if its value is
unused, but you cannot optimize one away in the presence of potential data-
or control-dependencies on it (including indirect ones such as escaping the
pointers through the IO system and comparing them somewhere else, then
reading back the result) that might in any way influence a later
integer-to-pointer cast. Does that match your intent?

Integer-to-pointer casts seem to have surprising evaluation semantics as a
result of this approach. It seems reasonable to me to give cases like this
defined behavior:

char buffer[32];
struct A { int n; };
struct B : A {};
struct C : A {};
A *p = new (buffer) B;
intptr_t x = (intptr_t)p;
A *q = new (buffer) C;
A *r = (A*)x; // ok, r points to the A base of the C object
r->n = 123; // OK, r points to the new A object not the old one

... but it matters when the cast from integer type to pointer type is
performed. If we move the cast to pointer type earlier, the resulting
example would not be defined under the "points within a live object" /
launder model:

A *p = new (buffer) B;
intptr_t x = (intptr_t)p;
A *r = (A*)x; // ok, r points to the A base of the B object
A *q = new (buffer) C;
r->n = 123; // undefined: r points to the old A object that is not within
its lifetime

Giving casts between pointer and integer types side-effects, and effects
that depend on when they're evaluated, makes me nervous. I think it'd be
preferable to give them a single-but-unknown provenance, following the
"pick whichever single value makes the rest of the program work" model of
P0593R3 (notionally pretty similar to C's effective type rule -- you
resolve to the first provenance that you use with the pointer). That'd mean
casts to pointer type are timeless and freely reorderable, and both the
above examples are defined, not only the first one. However, N2363 has a
scary example involving guessing the address of a function parameter that
is apparently defanged by the time-dependence of integer-to-pointer casts.
To what extent is that essential? Is there a different way that guessing a
pointer value could be disallowed?

The user-disambiguation refinement adds some complexity but supports
> roundtrip casts, from pointer to integer and back, of pointers that
> are one-past a storage instance.
>
>
>
>
> On 02/04/2019, Peter Sewell <Peter.Sewell_at_[hidden]> wrote:
> > Dear UB folk,
> >
> > continuing the discussion from last year at EuroLLVM, the GNU Tools
> > Cauldron,
> > and elsewhere, we (the WG14 C memory object model study group) now
> > have a detailed proposal for pointer provenance semantics, refining
> > the "provenance not via integers (PNVI)" model presented there.
> > This will be discussed at the ISO WG14 C standards committee at the
> > end of April, and comments from the community before then would
> > be very welcome. The proposal reconciles the needs of existing code
> > and the behaviour of existing compilers as well as we can, but it doesn't
> > exactly match any of the latter, so we'd especially like to know whether
> > it would be feasible to implement - our hope is that it would only
> require
> > minor changes. It's presented in three documents:
> >
> > N2362 Moving to a provenance-aware memory model for C: proposal for C2x
> > by the memory object model study group. Jens Gustedt, Peter Sewell,
> > Kayvan Memarian, Victor B. F. Gomes, Martin Uecker.
> > This introduces the proposal and gives the proposed change to the
> standard
> > text, presented as change-highlighted pages of the standard
> > (though one might want to read the N2363 examples before going into
> that).
> > http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2362.pdf
> >
> > N2363 C provenance semantics: examples.
> > Peter Sewell, Kayvan Memarian, Victor B. F. Gomes, Jens Gustedt, Martin
> > Uecker.
> > This explains the proposal and its design choices with discussion of a
> > series of examples.
> > http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2363.pdf
> >
> > N2364 C provenance semantics: detailed semantics.
> > Peter Sewell, Kayvan Memarian, Victor B. F. Gomes.
> > This gives a detailed mathematical semantics for the proposal
> > http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2364.pdf
> >
> > In addition, at http://cerberus.cl.cam.ac.uk/cerberus we provide an
> > executable version of the semantics, with a web interface that
> > allows one to explore and visualise the behaviour of small test
> > programs, stepping through and seeing the abstract-machine
> > memory state including provenance information. N2363 compares
> > the results of this for the example programs with gcc, clang, and icc
> > results, though the tests are really intended as tests of the semantics
> > rather than compiler tests, so one has to interpret this with care.
> >
> > best,
> > Peter (for the study group)
> >
> _______________________________________________
> ub mailing list
> ub_at_[hidden]
> http://www.open-std.org/mailman/listinfo/ub
>

Received on 2019-04-13 03:45:26