sg12: Re: [ub] A proposal to define signed overflow submitted?

From: Hyman Rosen <hyman.rosen_at_[hidden]>
Date: Fri, 16 Mar 2018 02:31:57 -0400

On Thu, Mar 15, 2018 at 9:44 PM, Nick Lewycky <nlewycky_at_[hidden]> wrote:
>
> As-if transformations maintain the actions of the written program under
> the assumption that undefined behaviour does not occur. The two are
> fundamentally intertwined.
>

Nonsense. Compilers have been doing as-if manipulations forever without
reference
to undefined behavior. Things like strength reduction, loop unrolling,
replacement of
multiplication by shifts, and a host of other transformations work to
produce the same
results as the written code. The obvious implication of your statement, if
it were true,
is that compilers would be unable to perform as-if transformations for
programming
languages in which all behavior was well defined. That's obviously wrong.

Here's a few thought experiments:
>

The overall problem with your list of examples below is that they are
postulating
meta-knowledge outside of the program itself, while my objections to
undefined
behavior issues deal with the objects of the program. Your examples are
about
process, my desiderata are about results. So your examples are basically
straw
men.

> 1. may a compiler replace "x+1+1" with "x+2" assuming x is an int? The two
> produce the same results in all cases, even if you choose wrapping,
> overflowing, or saturating arithmetic. Suppose you are the vendor of a
> compiler and the user complains "I know the add instruction encoding on my
> CPU, I crafted a pointer to my instruction stream, I failed to find two
> add-with-constant-1 instructions". With what rules would you justify the
> transformation "x+1+1" to "x+2"? Currently the answer is that there is no
> means to form such a pointer: any attempt to do so necessarily executes UB,
> therefore we assume it has not occurred.
>

The language and the compiler do not promise the generation of any
particular
sequence of instructions. (They could if they wanted, but they don't.)
What is
promised, or what was promised by K&R, is that the arithmetic in the
language
maps to the arithmetic provided by the underlying instruction set of the
computer
on which the program runs. Today, and even then, this is overwhelmingly
2's-
complement arithmetic. The compiler can replace x+1+1 with x+2 provided
that
the underlying instruction set produces the same result when either of the
two
operations are executed.

> 2. may a compiler remove "int i = *ptr;" where variable 'i' is
> subsequently unused? What if 'ptr' is unaligned or null? Suppose you are
> the vendor of a compiler and the user complains, "the underlying hardware
> traps on misaligned/null pointers, I expected this execution to trap with
> the appropriate type of trap". With what rules would you justify deleting
> this dead code? Currently the answer is that null pointer dereference and
> misaligned access (well, access of wrong type) are unde
>

Yes. As-if transformations allow for code to be eliminated if it does not
produce
results visible within constructs of the language. Furthermore, C and C++
have
already made provision for users who want the trap to be produced. Declare
the
pointed-to object to be volatile, and the compiler will generate the access.

I believe failure to dereference a provably null pointer at one point
caused a
significant error in the Linux kernel, because within the operating system
code
it happened that there was valid memory at address zero that needed to be
read and written.

> 3. may a compiler remove "int i = 0xdeadbeef;" where 'i' is unused? Suppose
> you are the vendor of a compiler and the user complains "I put that value
> in a stack variable so that I could look through my stack for the magic
> number and know where I was in the stack". With what rules would you
> justify deleting an unused variable? Currently the answer is that there is
> no means to form such a pointer: any attempt to do so necessarily executes
> UB, therefore we assume it has not occurred.
>

"The stack" is not an object defined by the language, and as-if
transformations
allow for the removal of code that does not have visible effect on objects
defined
by the language. Once again, the variable can be declared volatile, and
then the
value will be written, and the user can find it.

> 3a) A variation on question 3 is to put two variable declarations inside
> two arms of an if/else, assigning them to different values. Today every
> compiler will allocate space on the stack for all variables once as part of
> function prologue, except for functions with VLAs or dynamic alloca calls.
> What if the user looks up the stack to determine which side of the if/else
> they were being called from?
>

Once again, volatile variables will be written.

Today, all these optimizations are justified by the fact that any attempt
> to observe the difference before and after optimization must necessarily
> execute undefined behaviour, and executing undefined behaviour is a priori
> assumed to be impossible.
>

No. All these optimizations are justifiable by reference to the constructs
of the
language.

It's possible that there's another way to show that these and other
> optimizations are valid, but I would need to see a proposal on how to do
> that. I've tried to create one myself and failed. Frankly, I'm starting to
> believe that our current situation with UB really is the best possible
> design choice.
>

The key is side-effects. Programs make the results of their computation
visible
externally in some fashion. If those results are the same as those of the
abstract
machine, then the program has been compiled correctly. Otherwise, not.

The difficulty is that the compiler would then need to have a processor
> emulator for the target platform. This is not presently how compilers work.
>

2's-complement integer arithmetic is the same everywhere, so it doesn't
need to
be "emulated", just executed. Floating-point was more difficult in the old
days
(and in fact, that's why template value parameters can't be floats), but
with IEEE
floating-point now being ubiquitous, getting exactly equivalent results for
that is
now also relatively easy.

The notion that a compiler cannot easily evaluate an expression in the same
way
that a target platform would is frankly ridiculous, especially for integer
expressions.

> It's possible that this emulator could in fact be made simple by dealing
> only with the subset of cases that the compiler itself relies on. Presently
> compilers simulate execution on top of a single abstract machine described
> in the language specification, and produce programs with equivalent
> behaviour to the abstract machine. The details of the underlying machine
> *largely* don't trickle upwards to affect evaluation in the abstract
> machine (except when it does, like in sizeof).
>

Sizeof isn't a "trickling upward" of the machine into the compiler, except
to the tiny
extent that the compiler has to pick some size and alignment for its
objects. Once
those are established, calculations of array and record sizes are just
simple arithmetic,
with no machine peculiarities.

Unless you have an example that shows otherwise?

I'm not clever enough to see what possible argument you could make. Maybe
> you can suggest something? The setup is that you've written a program and
> the compiler emits assembly you didn't consider "the obvious lowering". You
> complain to the compiler authors and they say "yes, this is the obvious
> lowering". Now how would you proceed? There's no specification to
> reference to determine whether the compiler did emit the "obvious lowering"
> or not, that's what I meant by "not having standing".
>

I guess we would have the same useless argument that we're having here,
they would leave the compiler unchanged, and I would continue to be annoyed.
What can I say? I've pretty much decided that no one can ever be convinced
that they're wrong, but that doesn't mean I'll stop presenting my side.

(For what it's worth, I have given up on my notion that reading
uninitialized
memory should produce arbitrary but constant results, because of how
MADV_DONTNEED works.)

I agree! But I arrived at the opposite conclusion. I wish compilers
> exploited undefined behaviour to the fullest extent, and I wish they had
> started doing more of it longer ago. If only the optimizer broke people's
> programs the first time they ran them, programmers would have noticed the
> failure sooner and not proceeded with their buggy code. The sooner the
> optimizers get better at breaking buggy code in mean and nasty ways, the
> sooner we'll stop writing even more code with these kinds of bugs in it. I
> would rather deal with that pain now so that we have a better world
> tomorrow.
>

The code is only buggy because the language has made something undefined
that should have been implementation-defined at worst. And the fundamental
problem with undefined behavior is exactly the same as with unspecified
order
of evaluation. The programmers write programs that they think express their
intent correctly. The translation system produces an executable that
matches
their expectations by accident, and years later, another translation system
does
not, and produces a program that no longer works correctly.

As I have said here before ad nauseum, the purpose of a program is to
control
the operation of a computer. If the programming language allows a program
to
have multiple meanings, then the intent of the programmer may not be
mirrored
by the programming system at a point in the future far removed from when the
program was written, and literal disaster may ensue.

The as-if rule only applies if the program can not tell the different
> between the before program and the after program. If you can form pointers
> this way, the program now has a way to observe the difference, revoking the
> compiler's permission to perform the optimization. With these rules, the
> compiler may no longer remove stack variables from the stack, or reorder
> them, etc. I don't think that's the language we want. Do you have a
> suggestion on how to both permit the construction of such pointers and also
> permit this optimization?
>

You can form pointers to objects in this way if those objects exist in a
form
that can be pointed to. But that's not required by the language. For
things
that are in memory, the compiler must make conservative assumptions in
the presence of pointer derefernce unless it can prove that a pointer cannot
access a particular region of storage.

> I don't think C++ users want a language where copies of all automatic
> storage variables must be present on the stack.
>

Neither do I.

>
> There might be a market for a different language which really does behave
> that way, but I should point out that it's been tried. The proposal for
> Friendly C: https://blog.regehr.org/archives/1180 and later The Problem
> with Friendly C: https://blog.regehr.org/archives/1287 .
>

"There are just too many variations, each with its own set of performance
tradeoffs, for consensus to be possible."

In other words, derailed by optimizationists. I'm not even a little
surprised.

Received on 2018-03-16 07:32:21