ISOCPP Std Discussion List: Re: UB in P2641 'Checking if a union alternative is active'

From: Matthew House <mattlloydhouse_at_[hidden]>
Date: Mon, 19 Jun 2023 22:11:24 -0400

On Mon, Jun 19, 2023 at 8:46 PM Brian Bi via Std-Discussion
<std-discussion_at_[hidden]> wrote:
>> More generally, such an interpretation would completely break
>> mechanisms enabled by [basic.lval]/11 using pointers that happen to
>> refer to union members, since inactive union members would always have
>> preference over reinterpretations allowed by the rule. For instance,
>> suppose that u.c were declared as an unsigned char instead of a char.
>> Then, std::memcpy(dest, &u.i, sizeof(int)) would be UB, since by
>> reinterpreting its argument as an array of unsigned char, memcpy would
>> produce a pointer to u.c, then read past its end. I don't think that's
>> something that can be considered reasonable.
>
> Well, `std::memcpy` can be defined by magic to do the right thing, but I guess you're talking about a user-written analogue. Still, I don't understand your argument. Under the current wording, you can't write such a thing and have it have well-defined behavior according to the letter of the law, regardless of what view you take on whether the `u.c` object exists when it's not active.

I'll admit, I don't understand the argument that P1839 seems to hinge
on, to argue that even reading the first byte of the object
representation is UB:

> When a is dereferenced, the behaviour is undefined as per [expr.pre]
> p4 because the value of the resulting expression would not be the
> value of the first byte, but the value of the whole int object
> (123456), which is not a value representable by unsigned char.

This interpretation appears to defeat the entire purpose of
the first sentence in [basic.lval]/11, which I will repeat here for
reference:

> If a program attempts to access (3.1) the stored value of an object
> through a glvalue whose type is not similar (7.3.6) to one of the
> following types the behavior is undefined:
> - the dynamic type of the object,
> - a type that is the signed or unsigned type corresponding to the
> dynamic type of the object, or
> - a char, unsigned char, or std::byte type.

I have always imagined this as implying a series of steps for
performing a read where the type of the glvalue is not similar to the
dynamic type of the object:
1. Locate the object referred to by the glvalue.
2. Select the appropriate bytes in the object representation.
3. Interpret those bytes as a value of the glvalue's type.
4. Return the resulting value.
(The reverse process would occur for a modification.)

Indeed, [basic.lval]/11 originates from an analogous clause in
standard C, which at another point explicitly clarifies the supremacy
of the lvalue's type: "The meaning of a value stored in an object or
returned by a function is determined by the *type* of the expression
used to access it."

But C++ isn't so clear about the result when an object is
reinterpreted as another type using [basic.lval]/11. Apart from
[basic.lval]/11 itself, the most relevant language I could find is in
[conv.lval]/3:

> The result of the conversion is determined according to the
> following rules:
> [...]
> - Otherwise, the object indicated by the glvalue is read (3.1), and
> the value contained in the object is the prvalue result.

"The object indicated by the glvalue" is surely the object that the
glvalue refers to, but "the value contained in the object" is somewhat
ambiguous, especially since the clause references no mechanism for
converting to the glvalue's type. Is "the value contained" exactly the
value of the object in its dynamic type? Or is "the value contained"
the value resulting from interpreting the object representation as a
value of the glvalue's type? P1839 briefly assumes the former, but I
don't see how that interpretation can be squared with the purpose of
[basic.lval]/11.

(If we were to make the second interpretation explicit in
[conv.lval]/3 and [expr.ass]/2, it would obviate the first problem
brought up in the paper. Yet the problem of allowing pointer
arithmetic with an unsigned char* pointer to a general object would
remain. But the paper's proposal seems quite ugly to me; in my view,
this would be more cleanly solved by introducing a new kind of
pointer, a "pointer into the middle of an object", which would be
created by pointer arithmetic with a byte type, unusable for anything
but pointer arithmetic and accesses with byte types, and only able to
range through the original object's containing array. If such a
pointer is offset to line back up with the original object or another
array element, it would become an ordinary pointer-to-object again.
This would appear to avoid all the issues with pointers flip-flopping
between ordinary objects and the special "object representation"
objects proposed by the paper. Are there any glaring issues with this
sort of approach that I'm not thinking of?)

Received on 2023-06-20 02:11:36