On Mon, Apr 15, 2019 at 2:29 PM Peter Sewell <Peter.Sewell@cl.cam.ac.uk> wrote:
On 15/04/2019, Richard Smith <richardsmith@googlers.com> wrote:
> On Mon, Apr 15, 2019 at 7:16 AM Peter Sewell <Peter.Sewell@cl.cam.ac.uk>
[...]
>> We've also heard suggestions that compilers do things here, but in C
>> the container-of idiom seems pervasive, and WG14 at the last meeting
>> expressed a large majority that it has to be permitted by the
>> standard.
>>
>
> This is likely one of the areas where C++ wants to have stricter rules than
> C, at least for some categories of types. (We already have limitations on
> the types for which offsetof can be used, and don't require offsets to be
> constant in all cases, and even permit some flavours of structs to not
> store their members within the memory associated with the struct object at
> all.) We probably want to follow C in this area at least for
> standard-layout types, though.

y

>> As a consequence of the above, the specification for std::launder says:
>> >
>> > """
>> > Expects: [...] All bytes of storage that would be reachable through the
>> result are reachable through p (see below).
>> > [...]
>> > Remarks: [...] A byte of storage is reachable through a pointer value
>> that points to an object Y if it is within the storage occupied by Y, an
>> object that is pointer-interconvertible with Y, or the
>> immediately-enclosing array object if Y is an array element.
>> > """
>>
>> (small thing: you might or might not want that to support movement
>> through nested arrays)
>>
>> > Can this be accommodated by the "recreate the provenance on
>> integer-to-pointer cast" model? I think it's not accommodated by the
>> approach you describe above -- the above description would seem to
>> suggest
>> that we need to allow access to the entirety of the largest live object
>> whose address was taken and which encloses the address represented by the
>> integer. So casting an A* to an integer could allow navigation to the
>> enclosing B, if the B object has also had its address cast to an integer.
>>
>> Yes.  The current proposal allows that.   We've not nailed down
>> subobject issues, but one approach that several of us favour would be
>> to enforce subobject boundaries except for void* or character-type*
>> pointer arithmetic - that would let the offsetof container-of idioms
>> still work, while permitting error detection in cases where (eg)
>> someone does int* pointer arithmetic to move between struct members.
>>
>
> I think something like that would make sense.
>
> C++ has a notion of an array of byte-like type providing storage for
> another object (with no formal subobject relationship). I think it would
> make sense to say that all mechanisms by which storage is allocated (of all
> storage durations) implicitly create an array of such a byte-like type
> covering the entire allocation, so every object either is, or is nested
> within, some array of byte-like type representing a storage allocation.
> Then pointer arithmetic on pointers to elements of that enclosing array
> would permit arbitrary navigation, and otherwise navigation would be
> constrained to subobject relationships.

y  (Jens G has a proposal for C along those lines, though as I say
we've not yet actually done the subobject bits yet)

>> Personally I'm not sure how far we should go to defend this "irreversible
>> subobject navigation" property, but my implementation isn't one that
>> takes
>> advantage of it.
>> >
>> >> A storage instance is deemed exposed by a cast of a pointer to it to
>> >> an integer type, by a read (at non-pointer type) of the representation
>> >> of the pointer, or by an output of the pointer using %p.
>> >
>> >
>> > Hmm. Does that mean that evaluation of a pointer-to-integer cast has a
>> side-effect, and cannot in general be optimized away even if its result
>> is
>> unused?
>>
>> In the source language, yes.  An intermediate language might perhaps
>> use a more liberal semantics that doesn't rely on that side effect.
>>
>> >Or is there an assumption being built in here that the only way to form
>> an equal integer value to cast back to a pointer will necessarily involve
>> a
>> computation that actually depends on the pointer-to-integer cast, even if
>> we don't explicitly require that?
>>
>> That will typically be true - that's what allocation-address
>> nondeterminism buys you.
>>
>> > How careful do we need to be in future to avoid breaking that
>> assumption? Consider case such as:
>> >
>> > int f(int mode, int offset = 0) {
>> >   int a, b;
>> >   if (mode == 1) { return (intptr_t)&a - (intptr_t)&b; }
>> >   intptr_t a_int = (intptr_t)&a;
>> >   intptr_t evil_a_int = (intptr_t)&b + offset;
>> >   int *evil_a = (int*)evil_a_int;
>> >   printf("%" PRIdPTR " %" PRIdPTR " %p %p\n", a_int, evil_a_int, &a,
>> evil_a);
>> >   if (getchar() == 'x') std::terminate(); // #1, allow user to abort if
>> integer/pointer values differ
>> >   a = 1;
>> >   return *evil_a;
>> > }
>> > int main() { return f(2, f(1)); }
>>
>> The precise interaction of UB with user input is another can of worms
>> that we've not yet really opened :-)     But, assuming that programs
>> can rely on facts about user input (perhaps based on whatever has been
>> output), which seems reasonable, and that here the user is "required"
>> to press x if the values differ, then
>>
>> > Is my implementation conforming if it prints out equal integer and
>> pointer values here, and yet (after the program is resumed by the user)
>> main doesn't return 1? Based on your description, I think the answer is
>> no,
>> which means that my weird pointer comparison machine (including a human
>> component) is effectively enough to keep the pointer exposed.
>>
>> As Martin said, the mode==1 execution of f isn't very interesting, as
>> the second allocations of a and b could be arbitrarily spaced.  But
>> then in the mode!=1 execution, a and b are both exposed, so the cast
>> (int*)evil_a_int will give usable provenance iff the passed-in offset
>> is correct, so if the user is required to check that, this should be
>> well-defined to return 1.
>>
>> If line #1 is deleted, then in some executions a_int will be distinct
>> to evil_a_int (by allocation nondeterminism), so indeed it will be UB.
>>
>> >However, if line #1 is deleted, I believe an implementation that prints
>> out equal values and then does not return 1 would be correct: the
>> implementation can claim that a_int != evil_a_int, so that the 'return
>> *evil_a;' had undefined behavior, and therefore it was permitted to do
>> whatever it liked, including printing out values that appeared to be the
>> same.
>>
>> >I /think/ the upshot is that you *can* optimize away a
>> > pointer-to-integer
>> cast if its value is unused, but you cannot optimize one away in the
>> presence of potential data- or control-dependencies on it (including
>> indirect ones such as escaping the pointers through the IO system and
>> comparing them somewhere else, then reading back the result) that might
>> in
>> any way influence a later integer-to-pointer cast. Does that match your
>> intent?
>>
>> yes
>>
>> >
>> > Integer-to-pointer casts seem to have surprising evaluation semantics
>> > as
>> a result of this approach. It seems reasonable to me to give cases like
>> this defined behavior:
>> >
>> > char buffer[32];
>> > struct A { int n; };
>> > struct B : A {};
>> > struct C : A {};
>> > A *p = new (buffer) B;
>> > intptr_t x = (intptr_t)p;
>> > A *q = new (buffer) C;
>> > A *r = (A*)x; // ok, r points to the A base of the C object
>> > r->n = 123; // OK, r points to the new A object not the old one
>>
>> If we think of an analogous example in which the B and C objects are
>> heap or stack allocated (with non-overlapping lifetimes) that happen
>> to get the same address, with the code checking that, then:
>>
>> > ... but it matters when the cast from integer type to pointer type is
>> performed.
>>
>> that does indeed make a difference.
>>
>> >If we move the cast to pointer type earlier, the resulting example would
>> not be defined under the "points within a live object" / launder model:
>> >
>> > A *p = new (buffer) B;
>> > intptr_t x = (intptr_t)p;
>> > A *r = (A*)x; // ok, r points to the A base of the B object
>> > A *q = new (buffer) C;
>> > r->n = 123; // undefined: r points to the old A object that is not
>> within its lifetime
>> >
>> > Giving casts between pointer and integer types side-effects, and
>> > effects
>> that depend on when they're evaluated, makes me nervous.
>>
>> As (I think) Jens said, the pointer-to-integer cast side-effect, of
>> marking the storage instance as exposed, seems like it more-or-less
>> has to be a temporal thing.
>>
>
> If I'm understanding the model correctly, I think that depends on your
> perspective.
>
> If I understand correctly, a pointer-to-integer cast is only temporal to
> the extent that there must be something "forcing" it to happen before a
> integer-to-pointer cast that depends upon it exposing an object. (Ideally,
> I think we'd like to specify that the integer-to-pointer cast has a
> dependency on the pointer-to-integer cast, but that's a major can of worms
> comparable to the consume memory order.)

(quite. let's not :-)

>  If we model a program execution as
> a set of possible executions (with defined behavior only if all possible
> executions have defined behavior), and model pointers as having
> nondeterministic corresponding integer values, then a pointer-to-integer
> cast (along with the other mechanisms that expose pointers as integers) can
> be viewed as a pure mathematical function (and in particular, it has no
> temporal dependence nor side-effects), but it's an oracle that exposes
> information that is not observable in any other way -- and we can see that
> a program that never uses the oracle cannot possibly correctly "guess" the
> integer corresponding to a pointer[1][2], so any integer-to-pointer
> conversion must necessarily introduce a possible execution with undefined
> behavior.
>  [1]: The oracle is the only way to determine the relevant information
> about which execution in the set of possible executions is currently
> occurring.

All that's a pretty exactly description of the PNVI-plain variant.
PNVI-ae-udi keeps the result of the pointer-to-integer cast pure, but
adds the "make this storage instance exposed" side effect.  And that
makes a difference only to the integer-to-pointer casts one can do.

(treating reads of representation bytes and suchlike as similar to p-to-i casts,
and accesses via pointers that have, for example, had some of their
bytes written via char* pointers, as similar to i-to-p casts)

>  [2]: In principle, a program could keep allocating memory and casting
> pointers to integers until it's seen all integer values except one, and
> then correctly guess the address of the remaining object. But an
> implementation can prevent that by refusing to allocate all addressable
> memory.

y.  This is the concern that pushed Juneyoung Lee et al. into their twin
allocation model.   But instead, for a source language, we can just
limit attention to programs that never almost-exhaust memory
(ie that leave space for a copy of their biggest (and suitably
aligned) allocation).



>
>> The integer-to-pointer cast *could* be left ambiguous until it's
>> resolved - but that allows other strange things, eg it can be
>> (eventually) resolved to an object that didn't even exist earlier.
>> The temporal view seems (I hope :-) quite easy for programmers to
>> understand.
>>
>
> I think it's surprising either way -- either you find that a cast from
> integer to pointer has an execution side-effect, and it matters where you
> write it, or you find that you can cast an integer into a pointer to an
> object that doesn't yet exist at the point of the cast.

true

> I think the former is surprising for practicing programmers, whereas the
> latter is likely only surprising for language lawyers.

hmm - the latter would be annoyingly complex to specify (like the "udi"
part of PNVI-ae-udi but worse, as one gradually accumulates constraints
on what the result of such a cast might be pointing to, based on
what arithmetic is done to the pointer).   It's hard to imagine that becoming
widely understood...

Under p0593r3 (recently approved by WG21's Evolution Working Group and on its way towards C++20), a similar "pick the answer that gives the program defined behavior" rule will already be in use in C++ for other cases. That's not to say that that means it'll be understood, but we will at least have precedent.

> Moreover, I'd expect
> the latter to be the model that implementations actually use,

Can you expand on that?  Not sure I understand.

Here's an example in C++ where implementations might care about which object a pointer points to:

struct A { virtual void f() = 0; };
struct B : A { void f() override; };
struct C : A { void f() override; };
alignas(B, C) std::byte storage[std::max(sizeof(B), sizeof(C))];
A *p = new (storage) B;
intptr_t n = (intptr_t)p;
A *q = (A*)n; // #0
new (storage) C;
q->f(); // #1
q->f(); // #2

The C++ rules allow us to assume that the two q->f() calls call the same function: the same pointer value is used, so the pointer must point to an object with the same dynamic type. This permits us to remove a redundant vptr load in line #2.

Under the temporal model, an implementation could go further and prove that there's a B object within its lifetime at the point of the integer-to-pointer cast in line #0, and thereby decide that #1 and #2 both call B::f instead of C::f. I'm suggesting that implementations aren't going to do that, and that instead they'll only make the more conservative assumption that #1 and #2 load the same vptr value without assuming what that value is.

More broadly, I would expect the implementation model actually used for pointer-to-int conversions and int-to-pointer conversions to be based on (a conservative approximation to) determining on which pointer-to-integer conversions a given integer-to-pointer conversion is value- or control-dependent, rather than giving either conversion side-effects in the intermediate representation.

>  so the
> difference between the two will become a theoretical gotcha and we'll end
> up with theoretically non-portable code relying on such casts being
> timeless.
>
> Regarding the pointer_from_integer_1pg.c example, I think we could address
> that case in a somewhat different way. Currently in C++ we have a rule that
> says:
>
> """
> When the end of the duration of a region of storage is reached, the values
> of all pointers representing the address of any part of that region of
> storage become invalid pointer values (6.7.2).
> """
>
> and similarly in C:
>
> """
> The value of a pointer becomes indeterminate when the object it points to
> (or just past) reaches the end of its lifetime.
> """

Yes, although there is currently a move for C to at least partially remove that,
as there are a bunch of longstanding concurrent algorithms that depend on
== comparison with pointers to lifetime-ended objects.  Not sure what will
happen there.

> (I'm ignoring the memory model problems with the use of "become[s]" here.)
> It would seem natural to extend this so it applies at both the point of
> allocation and the point of deallocation. Then it's not the cast to pointer
> that has temporal behavior; rather, it's the allocation of the
> automatic-storage-duration variable that causes there to be no valid
> pointers into that region of storage, and changes the value of 'p' in the
> example to an invalid/indeterminate pointer value. (It still ends up
> mattering whether you perform the cast inside or outside the function, but
> for a different reason.)

That's exotic - never thought of that possibility.  But I would like to have a
semantics that doesn't rely on lifetime end-zap if we can.

> This might be equivalent to integer-to-pointer casts being temporal in C
> (because objects are by definition the same as the storage regions they
> occupy), but not in C++.
>
>>I think it'd be preferable to give them a single-but-unknown provenance,
>> following the "pick whichever single value makes the rest of the program
>> work" model of P0593R3 (notionally pretty similar to C's effective type
>> rule -- you resolve to the first provenance that you use with the
>> pointer).

If an access is first, that's not too complex, but if pointer arithmetic happens
first, it gets messy to record the resulting constraints.  And we get a lot more
instantaneous action-at-a-distance as these get resolved.

>> That'd mean casts to pointer type are timeless and freely reorderable,
>> and
>> both the above examples are defined, not only the first one. However,
>> N2363
>> has a scary example involving guessing the address of a function
>> parameter
>> that is apparently defanged by the time-dependence of integer-to-pointer
>> casts. To what extent is that essential? Is there a different way that
>> guessing a pointer value could be disallowed?
>>
>> Simple allocation-address nondeterminism disallows pointer value
>> guessing.
>>
>> thanks,
>> Peter
>>
>>
>> >> The user-disambiguation refinement adds some complexity but supports
>> >> roundtrip casts, from pointer to integer and back, of pointers that
>> >> are one-past a storage instance.
>> >>
>> >>
>> >>
>>
>

best,
Peter