sg12: Re: [ub] "Provenance" and all that

From: Peter Sewell <Peter.Sewell_at_[hidden]>
Date: Fri, 12 Apr 2019 23:11:41 +0100

Hi Freek,

On 12/04/2019, Freek Wiedijk <freek_at_[hidden]> wrote:
> Hi Peter,
>
> Thanks for the friendly reply to my somewhat frustrated mail,
> I very much appreciate that. And many thanks for adding
> me to the study group list. (I don't know what happened to
> that mail back then, maybe it got caught in the spam filter,
> or lost in my sometimes rather chaotic mailbox.)
>
> If you don't mind, I'll send my questions about Cerberus
> to you privately for a while, in order not to bother the
> people on the lists. At least not until I'm up to speed.

sure

> However, a few replies to what you write (I can't restrain
> myself :-)), feel free to only answer them off-list:

might as well have answers public, so here goes:

>>This is tied up with the many questions relating to subobjects
>>and effective types. We've mostly deferred those up to now,
>>to get the basic proposal clarified, but we're now turning attention
>>to them.
>
> Does the current PNVI-ae-udi semantics already have coherent
> answers to these questions, and is it just a matter of
> finding out what the people in the study group think about
> those choices? Or will the semantics almost certainly need
> modification for these issues anyway (for example for the
> sub-objects), regardless of how people feel?

The current PNVI-ae-udi doesn't include any subobject
or effective-type distinctions. We have a bunch of
questions and examples, and some are clear, but
making a semantics for effective types that's reasonable
from a compiler p.o.v. may be hard.

> And how much of a formal version of this semantics is
> already available? Just this online thing and an updated
> PDF like N2364 to describe it?

At present, just that. The Cerberus source code
(a mix of Lem and pure OCaml, for these things) is closer
to what you want, but it also has a bunch of algorithmic
detail, and isn't yet released.

> Or also something more formal
> (as in a HOL4 or maybe Ott specification)?
>
> For me "effective types", "type punning", "provenance",
> "wobbly values", "lifetime-end zap", etc. all seem closely
> related, and I would like a coherent model that *explains*
> why these things are chosen the way they are. Probably you
> feel like this too?

Sure - though the real explanation has a lot to do with
compiler optimisation choices and history, so a unified
model (which we agree is highly desirable) won't by itself
necessarily explain them.

>>The existence of several well-considered and compared alternatives
>>is surely a good thing for the final candidate, not a negative :-)
>
> I was talking about the fact that it does not yet seem
> completely stable :-)

I'd say PNVI-ae-udi is now pretty stable.

>>> So the "udi" stands for "user-disambiguation". In the
>>> section on "ambiguous provenance" (section 6 of N2362)
>>> it is stated that if there is a cast from an integer to a
>>> pointer that is also at the end of an object, then which
>>> of it "is chosen by the programmer". What does this mean?
>>
>>It is left ambiguous until the first operation on the pointer that
>>requires it to be one or the other.
>
> This is a compile time or runtime thing? And how can it be
> "required" to be a "one past" pointer? So why isn't it
> always the "internal" version of the pointer then?

runtime. Eg if one subtracts one and then does an access, it
must have been a one-past pointer.

>>> #include <stdio.h>
>>> #include <stdint.h>
>>>
>>> int main()
>>> {
>>> int i = 1, j = 2;
>>> uintptr_t pi1 = (uintptr_t) (&i + 1), pj = (uintptr_t) &j;
>>> if (pi1 == pj) {
>>> pi1 += 0*pj; pi1 -= 0*pj; /* fix provenance for PVI */
>>> *(int *) pi1 = 3;
>>> printf("%d,%d,%d\n", i, j, *(int *) pi1);
>>> }
>>> return 0;
>>> }
>>>
>>> With sufficient optimization, gcc causes this to print
>>> "1,2,3". Of all the options *only* in PNVI-ae-udi this is
>>> allowed, right?
>>
>>I don't think that's allowed in any of the PNVI-* models.
>
> What? You mean the optimization of GCC would need to be
> *changed*, restricted, if PNVI-ae-udi++ would get into
> the standard?

It doesn't seem possible to make a reasonable semantics that
exactly hits all the current compiler behaviour, and some
existing compiler behaviour is due (we're told) to long-standing
bugs. So yes, it might be that all compilers would have to be
changed - but hopefully only in rather minor ways. Asking
that question is one reason we've recently raised this here
and also on the gcc and cfe-dev lists. Our impression from discussions
at EuroLLVM and GNU Cauldron meetings is that the model
is basically reasonable from their pov.

> But I don't understand. In PNVI-ae-udi, the programmer can
> choose (?) to have the coercion (int *) pi1 give a pointer
> with the provenance of i, right?

No, the provenance is kept in a superposition state until some
operation (that requires it to be one or the other) collapses it.

> So then *(int *) pi1 = 3
> would have undefined behavior? So why is the output "1,2,3"
> not allowed in that case?
>
>>You can try the example (with i and j flipped so that the default
>>Cerberus allocation policy does what you want) in Cerberus at this
>>link.
>
> Why do i and j need to be flipped, if I want to know what
> *all* possible behaviors of this program are?

By default Cerberus is running with a specific allocator,
otherwise it's too combinatorially heavy. It does also have
a mode with symbolic allocation addresses, maintaining and
resolving SMT constraints with a Z3 backend, but that
doesn't at present support PNVI-*.

> I would
> guess that the program can: print "1,3,3" in case
> ((int *) pi1) gets the provenance of j, exhibit undefined
> behavior which of course includes the other two behaviors in
> case ((int *) pi1) gets the provenance of i, or print nothing
> at all in case pi1 and pj differ. So this seems to hold
> regardless of the order of i and j, why would the order of
> i and j make a difference for the semantics of the program?
>
> But I haven't tried the tool, the link got damaged in
> the mail. So maybe I misunderstand what the tool gives you.
>
>>Note that these are only proposed as *source language* models.
>>The issues for a compiler intermediate language are somewhat
>>different, eg as recently explored by Juneyoung Lee et al. We
>>would hope that there is a refinement relation between the two models,
>>not identity.
>
> I was talking about a relation between source language
> programs: one in which i += 0*(...) is present, and one
> in which it isn't. I would like to be allowed to map both
> programs to the same intermediate language representation
> after optimizations, one that has undefined behavior.
> But that has implications for the *source* language
> semantics, because if the first program does *not* have
> undefined behavior, this is not possible.
>
> But, as PVI seems to be out of the window (for which I'm
> grateful, the less "provenance" the better), it doesn't
> matter.
>
>>I've added you now.
>
> Again: many thanks! And I'll mail you with many more
> questions :-)

sure :-) Either on cl-c-memory-object-model, or (if off-list and
really about Cerberus) cc Kayvan and Victor too.

best,
Peter

> Freek
>

Received on 2019-04-13 00:11:43