On Apr 12, 2019, at 8:09 AM, Peter Sewell <Peter.Sewell@cl.cam.ac.uk> wrote:

Hi all,

perhaps I can reset this discussion, which got bogged down in largely
independent questions about pointer equality, by summarising the
basic provenance proposal.   Try this below, extracted from n2363.
Comments on this?


C pointer values are typically represented at runtime as simple
concrete numeric values, but mainstream compilers routinely exploit
information about the "provenance" of pointers to reason that they
cannot alias, and hence to justify optimisations.  This is
long-standing practice, but exactly what it means (what programmers
can rely on, and what provenance-based alias analysis is allowed to
do), has never been nailed down.   That's what the proposal does.

The basic idea is to associate a *provenance* with every pointer
value, identifying the original storage instance (or allocation, in
other words) that the pointer is derived from.  In more detail:

- We take abstract-machine pointer values to be pairs (pi,a), adding a
 provenance pi, either @i where i is a storage instance ID, or the
 *empty* provenance, to their concrete address a.

- On every storage instance creation (of objects with static, thread,
 automatic, and allocated storage duration), the abstract machine
 nondeterministically chooses a fresh storage instance ID i (unique
 across the entire execution), and the resulting pointer value
 carries that single storage instance ID as its provenance @i.

- Provenance is preserved by pointer arithmetic that adds or subtracts
 an integer to a pointer.

- At any access via a pointer value, its numeric address must be
 consistent with its provenance, with undefined behaviour
 otherwise. In particular:

 -- access via a pointer value which has provenance a single storage
    instance ID @i must be within the memory footprint of the
    corresponding original storage instance, which must still be

 -- all other accesses, including those via a pointer value with
    empty provenance, are undefined behaviour.

Regarding such accesses as undefined behaviour is necessary to make
optimisation based on provenance alias analysis sound: if the standard
did define behaviour for programs that make provenance-violating
accesses, e.g.~by adopting a concrete semantics, optimisation based on
provenance-aware alias analysis would not be sound.  In other words,
the provenance lets one distinguish a one-past pointer from a pointer
to the start of an adjacently-allocated object, which otherwise are

All this is for the C abstract machine as defined in the standard:
compilers might rely on provenance in their alias analysis and
optimisation, but one would not expect normal implementations to
record or manipulate provenance at runtime (though dynamic or static
analysis tools might).

Then, to support low-level systems programming, C provides many other
ways to construct and manipulate pointer values:

- casts of pointers to integer types and back, possibly with integer
 arithmetic, e.g.~to force alignment, or to store information in
 unused bits of pointers;

- copying pointer values with memcpy;

- manipulation of the representation bytes of pointers, e.g.~via user
 code that copies them via char* or unsigned char* accesses;

- type punning between pointer and integer values;

- I/O, using either fprintf/fscanf and the %p format, fwrite/fread on
 the pointer representation bytes, or pointer/integer casts and
 integer I/O;

- copying pointer values with realloc; and

- constructing pointer values that embody knowledge established from
 linking, and from constants that represent the addresses of
 memory-mapped devices.

A satisfactory semantics has to address all these, together with the
implications on optimisation.  We've explored several, but our main
proposal is "PNVI-ae-udi" (provenance not via integers,
address-exposed, user-disambiguation).

This semantics does not track provenance via integers.  Instead, at
integer-to-pointer cast points, it checks whether the given address
points within a live object that has previously been *exposed* and, if
so, recreates the corresponding provenance.

A storage instance is deemed exposed by a cast of a pointer to it to
an integer type, by a read (at non-pointer type) of the representation
of the pointer, or by an output of the pointer using %p.

The user-disambiguation refinement adds some complexity but supports
roundtrip casts, from pointer to integer and back, of pointers that
are one-past a storage instance.

Thanks for that summary.

I _think_ that that also covers XOR linked lists, right?  (https://en.wikipedia.org/wiki/XOR_linked_list)


On 02/04/2019, Peter Sewell <Peter.Sewell@cl.cam.ac.uk> wrote:
Dear UB folk,

continuing the discussion from last year at EuroLLVM, the GNU Tools
and elsewhere, we (the WG14 C memory object model study group) now
have a detailed proposal for pointer provenance semantics, refining
the "provenance not via integers (PNVI)" model presented there.
This will be discussed at the ISO WG14 C standards committee at the
end of April, and comments from the community before then would
be very welcome.   The proposal reconciles the needs of existing code
and the behaviour of existing compilers as well as we can, but it doesn't
exactly match any of the latter, so we'd especially like to know whether
it would be feasible to implement - our hope is that it would only require
minor changes.  It's presented in three documents:

N2362  Moving to a provenance-aware memory model for C: proposal for C2x
by the memory object model study group.  Jens Gustedt, Peter Sewell,
Kayvan Memarian, Victor B. F. Gomes, Martin Uecker.
This introduces the proposal and gives the proposed change to the standard
text, presented as change-highlighted pages of the standard
(though one might want to read the N2363 examples before going into that).

N2363  C provenance semantics: examples.
Peter Sewell, Kayvan Memarian, Victor B. F. Gomes, Jens Gustedt, Martin
This explains the proposal and its design choices with discussion of a
series of examples.

N2364  C provenance semantics: detailed semantics.
Peter Sewell, Kayvan Memarian, Victor B. F. Gomes.
This gives a detailed mathematical semantics for the proposal

In addition, at http://cerberus.cl.cam.ac.uk/cerberus we provide an
executable version of the semantics, with a web interface that
allows one to explore and visualise the behaviour of small test
programs, stepping through and seeing the abstract-machine
memory state including provenance information.   N2363 compares
the results of this for the example programs with gcc, clang, and icc
results, though the tests are really intended as tests of the semantics
rather than compiler tests, so one has to interpret this with care.

Peter (for the study group)

ub mailing list