On Fri, Apr 12, 2019 at 5:09 AM Peter Sewell <Peter.Sewell@cl.cam.ac.uk> wrote:
Hi all,

perhaps I can reset this discussion, which got bogged down in largely
independent questions about pointer equality, by summarising the
basic provenance proposal.   Try this below, extracted from n2363.
Comments on this?

Thank you for prompting this discussion, and for the huge work that's gone into N2363.
 
best,
Peter



C pointer values are typically represented at runtime as simple
concrete numeric values, but mainstream compilers routinely exploit
information about the "provenance" of pointers to reason that they
cannot alias, and hence to justify optimisations.  This is
long-standing practice, but exactly what it means (what programmers
can rely on, and what provenance-based alias analysis is allowed to
do), has never been nailed down.

Not in C, but C++ has at least some rules for this already (although they are incomplete).

That's what the proposal does.


The basic idea is to associate a *provenance* with every pointer
value, identifying the original storage instance (or allocation, in
other words) that the pointer is derived from.  In more detail:

- We take abstract-machine pointer values to be pairs (pi,a), adding a
  provenance pi, either @i where i is a storage instance ID, or the
  *empty* provenance, to their concrete address a.

- On every storage instance creation (of objects with static, thread,
  automatic, and allocated storage duration), the abstract machine
  nondeterministically chooses a fresh storage instance ID i (unique
  across the entire execution), and the resulting pointer value
  carries that single storage instance ID as its provenance @i.

C++'s rules are somewhat stronger than this. The "provenance" information we use is associated with an object, not merely with a region of storage. For example, given:

char buffer[32];
struct A { int n; };
struct B : A {};
struct C : A {};
A *p = new (buffer) B;
A *q = new (buffer) C;

... the "provenance" associated with p and q are different. In C++'s terminology, p and q point to two different A objects (with non-overlapping lifetime; reuse of the storage containing *p results in the end of the lifetime of the *p object) that occupy the same storage (at different times).

- Provenance is preserved by pointer arithmetic that adds or subtracts
  an integer to a pointer.

- At any access via a pointer value, its numeric address must be
  consistent with its provenance, with undefined behaviour
  otherwise.

In C++ terms, an object can only be accessed within its lifetime.
 
In particular:

  -- access via a pointer value which has provenance a single storage
     instance ID @i must be within the memory footprint of the
     corresponding original storage instance, which must still be
     live.

  -- all other accesses, including those via a pointer value with
     empty provenance, are undefined behaviour.

Regarding such accesses as undefined behaviour is necessary to make
optimisation based on provenance alias analysis sound: if the standard
did define behaviour for programs that make provenance-violating
accesses, e.g.~by adopting a concrete semantics, optimisation based on
provenance-aware alias analysis would not be sound.  In other words,
the provenance lets one distinguish a one-past pointer from a pointer
to the start of an adjacently-allocated object, which otherwise are
indistinguishable.

All this is for the C abstract machine as defined in the standard:
compilers might rely on provenance in their alias analysis and
optimisation, but one would not expect normal implementations to
record or manipulate provenance at runtime (though dynamic or static
analysis tools might).


Then, to support low-level systems programming, C provides many other
ways to construct and manipulate pointer values:

I think this is this point where we're going beyond what C++ currently specifies.
 
- casts of pointers to integer types and back, possibly with integer
  arithmetic, e.g.~to force alignment, or to store information in
  unused bits of pointers;

- copying pointer values with memcpy;

- manipulation of the representation bytes of pointers, e.g.~via user
  code that copies them via char* or unsigned char* accesses;

- type punning between pointer and integer values;

- I/O, using either fprintf/fscanf and the %p format, fwrite/fread on
  the pointer representation bytes, or pointer/integer casts and
  integer I/O;

- copying pointer values with realloc; and

- constructing pointer values that embody knowledge established from
  linking, and from constants that represent the addresses of
  memory-mapped devices.


A satisfactory semantics has to address all these, together with the
implications on optimisation.  We've explored several, but our main
proposal is "PNVI-ae-udi" (provenance not via integers,
address-exposed, user-disambiguation).

This semantics does not track provenance via integers.  Instead, at
integer-to-pointer cast points, it checks whether the given address
points within a live object that has previously been *exposed* and, if
so, recreates the corresponding provenance.

For C++, this is almost equivalent to saying that an implicit 'launder' is performed on an integer-to-pointer cast. However, this is not a complete answer, because we first need to form a pointer value before we launder, and it matters which pointer value we form. Consider:

struct A { int x; };
struct B { int y; A a; };
void f(A *a);
void g() {
  B b = {1, 2};
  f(&b.a);
  return b.y;
}

In C++, it is valid to transform 'g' in the above program to:

void g() {
  A a = {2};
  f(&a);
  return 1;
}

... because we carefully ensure that there is no defined way to navigate from an A* to the B* containing it. I'm told that some compilers actually perform optimizations in this area, at least in some cases. When we were formulating the C++ rules, there was vendor push-back on adding any defined mechanism that would permit navigation from an A* to its enclosing B*.

As a consequence of the above, the specification for std::launder says:

"""
Expects: [...] All bytes of storage that would be reachable through the result are reachable through p (see below).
[...]
Remarks: [...] A byte of storage is reachable through a pointer value that points to an object Y if it is within the storage occupied by Y, an object that is pointer-interconvertible with Y, or the immediately-enclosing array object if Y is an array element.
"""

Can this be accommodated by the "recreate the provenance on integer-to-pointer cast" model? I think it's not accommodated by the approach you describe above -- the above description would seem to suggest that we need to allow access to the entirety of the largest live object whose address was taken and which encloses the address represented by the integer. So casting an A* to an integer could allow navigation to the enclosing B, if the B object has also had its address cast to an integer.

Personally I'm not sure how far we should go to defend this "irreversible subobject navigation" property, but my implementation isn't one that takes advantage of it.

A storage instance is deemed exposed by a cast of a pointer to it to
an integer type, by a read (at non-pointer type) of the representation
of the pointer, or by an output of the pointer using %p.

Hmm. Does that mean that evaluation of a pointer-to-integer cast has a side-effect, and cannot in general be optimized away even if its result is unused? Or is there an assumption being built in here that the only way to form an equal integer value to cast back to a pointer will necessarily involve a computation that actually depends on the pointer-to-integer cast, even if we don't explicitly require that? How careful do we need to be in future to avoid breaking that assumption? Consider case such as:

int f(int mode, int offset = 0) {
  int a, b;
  if (mode == 1) { return (intptr_t)&a - (intptr_t)&b; }
  intptr_t a_int = (intptr_t)&a;
  intptr_t evil_a_int = (intptr_t)&b + offset;
  int *evil_a = (int*)evil_a_int;
  printf("%" PRIdPTR " %" PRIdPTR " %p %p\n", a_int, evil_a_int, &a, evil_a);
  if (getchar() == 'x') std::terminate(); // #1, allow user to abort if integer/pointer values differ
  a = 1;
  return *evil_a;
}
int main() { return f(2, f(1)); }

Is my implementation conforming if it prints out equal integer and pointer values here, and yet (after the program is resumed by the user) main doesn't return 1? Based on your description, I think the answer is no, which means that my weird pointer comparison machine (including a human component) is effectively enough to keep the pointer exposed.  However, if line #1 is deleted, I believe an implementation that prints out equal values and then does not return 1 would be correct: the implementation can claim that a_int != evil_a_int, so that the 'return *evil_a;' had undefined behavior, and therefore it was permitted to do whatever it liked, including printing out values that appeared to be the same. I /think/ the upshot is that you *can* optimize away a pointer-to-integer cast if its value is unused, but you cannot optimize one away in the presence of potential data- or control-dependencies on it (including indirect ones such as escaping the pointers through the IO system and comparing them somewhere else, then reading back the result) that might in any way influence a later integer-to-pointer cast. Does that match your intent?


Integer-to-pointer casts seem to have surprising evaluation semantics as a result of this approach. It seems reasonable to me to give cases like this defined behavior:

char buffer[32];
struct A { int n; };
struct B : A {};
struct C : A {};
A *p = new (buffer) B;
intptr_t x = (intptr_t)p;
A *q = new (buffer) C;
A *r = (A*)x; // ok, r points to the A base of the C object
r->n = 123; // OK, r points to the new A object not the old one

... but it matters when the cast from integer type to pointer type is performed. If we move the cast to pointer type earlier, the resulting example would not be defined under the "points within a live object" / launder model:

A *p = new (buffer) B;
intptr_t x = (intptr_t)p;
A *r = (A*)x; // ok, r points to the A base of the B object
A *q = new (buffer) C;
r->n = 123; // undefined: r points to the old A object that is not within its lifetime

Giving casts between pointer and integer types side-effects, and effects that depend on when they're evaluated, makes me nervous. I think it'd be preferable to give them a single-but-unknown provenance, following the "pick whichever single value makes the rest of the program work" model of P0593R3 (notionally pretty similar to C's effective type rule -- you resolve to the first provenance that you use with the pointer). That'd mean casts to pointer type are timeless and freely reorderable, and both the above examples are defined, not only the first one. However, N2363 has a scary example involving guessing the address of a function parameter that is apparently defanged by the time-dependence of integer-to-pointer casts. To what extent is that essential? Is there a different way that guessing a pointer value could be disallowed?

The user-disambiguation refinement adds some complexity but supports
roundtrip casts, from pointer to integer and back, of pointers that
are one-past a storage instance.




On 02/04/2019, Peter Sewell <Peter.Sewell@cl.cam.ac.uk> wrote:
> Dear UB folk,
>
> continuing the discussion from last year at EuroLLVM, the GNU Tools
> Cauldron,
> and elsewhere, we (the WG14 C memory object model study group) now
> have a detailed proposal for pointer provenance semantics, refining
> the "provenance not via integers (PNVI)" model presented there.
> This will be discussed at the ISO WG14 C standards committee at the
> end of April, and comments from the community before then would
> be very welcome.   The proposal reconciles the needs of existing code
> and the behaviour of existing compilers as well as we can, but it doesn't
> exactly match any of the latter, so we'd especially like to know whether
> it would be feasible to implement - our hope is that it would only require
> minor changes.  It's presented in three documents:
>
> N2362  Moving to a provenance-aware memory model for C: proposal for C2x
> by the memory object model study group.  Jens Gustedt, Peter Sewell,
> Kayvan Memarian, Victor B. F. Gomes, Martin Uecker.
> This introduces the proposal and gives the proposed change to the standard
> text, presented as change-highlighted pages of the standard
> (though one might want to read the N2363 examples before going into that).
> http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2362.pdf
>
> N2363  C provenance semantics: examples.
> Peter Sewell, Kayvan Memarian, Victor B. F. Gomes, Jens Gustedt, Martin
> Uecker.
> This explains the proposal and its design choices with discussion of a
> series of examples.
> http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2363.pdf
>
> N2364  C provenance semantics: detailed semantics.
> Peter Sewell, Kayvan Memarian, Victor B. F. Gomes.
> This gives a detailed mathematical semantics for the proposal
> http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2364.pdf
>
> In addition, at http://cerberus.cl.cam.ac.uk/cerberus we provide an
> executable version of the semantics, with a web interface that
> allows one to explore and visualise the behaviour of small test
> programs, stepping through and seeing the abstract-machine
> memory state including provenance information.   N2363 compares
> the results of this for the example programs with gcc, clang, and icc
> results, though the tests are really intended as tests of the semantics
> rather than compiler tests, so one has to interpret this with care.
>
> best,
> Peter (for the study group)
>
_______________________________________________
ub mailing list
ub@isocpp.open-std.org
http://www.open-std.org/mailman/listinfo/ub