C++ Logo

std-proposals

Advanced search

Re: [std-proposals] is_trivially_copyable_in_reality

From: Frederick Virchanza Gotham <cauldwell.thomas_at_[hidden]>
Date: Thu, 29 Jan 2026 12:30:05 +0000
On Mon, Jan 26, 2026 at 3:47 PM Thiago Macieira wrote:
>
> Because the problem wasn't vectors. It was std::any and QVariant, which type-
> erase the copying in the first place. A trivially copyable type must be
> memcpyable, as I'm sure you'll agree. Note it's a sufficient condition, not a
> necessary one.
>
> Re-signing the vptr isn't a memcpy. Therefore, that type shouldn't be marked
> trivially copyable.


Okay you've totally lost me here, I'm confused, and so I'm going back
to first principles. I'm working with the GNU g++ compiler on x86_64
Linux with maximal optimisation (-O3).

First let's start off with a nice big chunky type. Let's go with 2
megabytes aligned to 2 megabytes:

    struct alignas(0x200000) S {
      void *pointers[0x200000 / sizeof(void*)];
    };

Now let's write a function that makes a copy of S:

    #include <cstddef>
    #include <cstring>
    using namespace std;

    constexpr size_t Ssz = 0x200000;

    struct alignas(Ssz) S {
      void *pointers[Ssz / sizeof(void*)];
    };

    void CopyA(void *const dst, S const *const src)
    {
      memcpy( dst, src, sizeof(S) );
    }

GodBolt: https://godbolt.org/z/ezcEnnnGz

You'll see that the assembler for "CopyA" is two instructions as follows:

      mov $0x200000,%edx
      jmp memcpy

When 'CopyA' is entered, the destination is already in RDI. The source
is already in RSI. So it puts the size, i.e. 2 megabytes, in RDX. Then
it jumps to memcpy. It's a jump rather than a call for two reasons:
(1) Nothing more happens in CopyA after memcpy returns
(2) The return value of memcpy is compatible with the return value of CopyA
So the compiler has optimised the 'call' into a 'jmp'.

Now let's re-write 'CopyA' as follows:

    void CopyB(void *const dst, S const *const src)
    {
        ::new(dst) S(*src);
    }

I expect this to give the exact same assembler. GodBolt:
https://godbolt.org/z/4jKsYPGq1

But here's what we see:

  sub $0x8,%rsp
  mov $0x200000,%edx
  call memcpy
  add $0x8,%rsp
  ret

Oddly, this is a tiny bit less efficient. The stack pointer is moved
by 8 bytes, but nothing is stored on the stack so it's pointless, the
invocation of memcpy is a less-efficient 'call' instead of a 'jmp',
and then the stack pointer is restored and the function returns. I'm
actually surprised that the GNU g++ compiler didn't do this a little
better. But anyway, the difference between CopyA and CopyB is tiny.

I will never argue that memcpy is better than a copy-constructor when
dealing with just one object. (Even though I've just proven on one
compiler that memcpy is a tiny bit more efficient). So forget about
single objects.

Let's move on to multiple objects. I'm talking about arrays of objects
and containers full of objects, so think std::array<int,12345> and
std::vector<int>. Here's some code, note that CopyA and CopyB now have
an extra parameter for the count of objects:

    #include <cstddef>
    #include <cstring>
    #include <new>
    using namespace std;

    constexpr size_t Ssz = 0x200000;

    struct alignas(Ssz) S {
      void *pointers[Ssz / sizeof(void*)];
    };

    void CopyA(void *const dst, S const *const src, size_t const count)
    {
      memcpy(dst, src, count * sizeof(S));
    }

    void CopyB(void *const dst, S const *const src, size_t const count)
    {
      for ( size_t i = 0u; i < count; ++i ) ::new((S*)dst + i) S( src[i] );
    }

GodBolt: https://godbolt.org/z/xqr6813q7

Let's look at how the assembler comes out. Here's CopyA:

 shl $0x15,%rdx
 jmp memcpy

First instructions shifts the count left 21 places (to convert to bytes).
Second instruction jumps to memcpy.
No surprises here.

Now let's look at CopyB:

 test %rdx,%rdx
 je 60 <CopyB(void*, S const*, unsigned long)+0x50>
 push %r13
 shl $0x15,%rdx
 mov %rsi,%r13
 push %r12
 mov %rdi,%r12
 push %rbp
 mov %rdx,%rbp
 push %rbx
 xor %ebx,%ebx
 sub $0x8,%rsp
 xchg %ax,%ax
 lea (%r12,%rbx,1),%rdi
 lea 0x0(%r13,%rbx,1),%rsi
 mov $0x200000,%edx
 add $0x200000,%rbx
 call 4a <CopyB(void*, S const*, unsigned long)+0x3a>
    R_X86_64_PLT32 memcpy-0x4
 cmp %rbx,%rbp
 jne 30 <CopyB(void*, S const*, unsigned long)+0x20>
 add $0x8,%rsp
 pop %rbx
 pop %rbp
 pop %r12
 pop %r13
 ret

The above is more complicated, and we see that memcpy is called
several times in a loop. I tried re-writing it to use pointers instead
of an integer index:

    void CopyB(void *const dstV, S const *src, size_t const count)
    {
      S *dst = (S*)dstV;
      S const *const dst_over = dst + count;
      while ( dst != dst_over ) ::new(dst++) S(*src++);
    }

But it comes out pretty much the same.

So we can see that calling a copy-constructor in a loop isn't
efficient. Let's try moving instead, so we change:

    ::new(dst++) S( *src++ );

to:

    ::new(dst++) S( (S&&)*src++ );

I made this change but it has no effect on the assembler.

Let's see if we can trust the compiler to do a better job . . . I'm
going to use 'std::array' as follows:

    void CopyC(void *const dst, array<S,16384u> &src )
    {
      ::new(dst) remove_reference_t<decltype(src)>(src);
    }

The assembler for 'CopyC' comes out as:

    movabs $0x800000000,%rdx
    sub $0x8,%rsp
    call 83 <CopyC(void*, std::array<S, 16384ul>&)+0x13>
    R_X86_64_PLT32 memcpy-0x4
    add $0x8,%rsp
    ret

This is pretty good -- I think the implementation of 'std::array'
checks the trait 'is_trivially_copyable' (or perhaps
'is_trivially_copy_constructible') and so knows to use 'memcpy'.

But let's tweak our class S as follows:

    struct alignas(Ssz) S final {
        virtual ~S(void) noexcept = default;
        void *pointers[Ssz / sizeof(void*) - 1u];
    };

I've made three changes:
(1) I've made it final -- so we know that we always have a complete
object (i.e. most-derived object)
(2) I've added a virtual destructor to make it polymorphic (adding 8
bytes to the object size)
(3) I've decreased the array size by 8 bytes to keep the total object
size at 2 megabytes

So let's see what happens to the assembler for 'CopyC' after making this change:

 movabs $0x800000000,%rdx
 push %r13
 mov %rdi,%rax
 push %r12
 add %rdi,%rdx
 mov %rsi,%r12
 push %rbp
 mov %rdi,%rbp
 push %rbx
 sub $0x8,%rsp
 movq $0x0,(%rax)
    R_X86_64_32S vtable for S+0x10
 add $0x400000,%rax
 movq $0x0,-0x200000(%rax)
    R_X86_64_32S vtable for S+0x10
 cmp %rax,%rdx
 jne e0 <CopyC(void*, std::array<S, 16384ul>&)+0x20>
 movabs $0x800000000,%r13
 xor %ebx,%ebx
 nopl 0x0(%rax)
 lea 0x8(%rbp,%rbx,1),%rdi
 lea 0x8(%r12,%rbx,1),%rsi
 mov $0x1ffff8,%edx
 add $0x200000,%rbx
 call 12b <CopyC(void*, std::array<S, 16384ul>&)+0x6b>
    R_X86_64_PLT32 memcpy-0x4
 cmp %r13,%rbx
 jne 110 <CopyC(void*, std::array<S, 16384ul>&)+0x50>
 add $0x8,%rsp
 pop %rbx
 pop %rbp
 pop %r12
 pop %r13
 ret

So we can see here that we've lost all the efficiency because we made
S polymorphic -- which has the knock-on effect of making
'is_trivially_copy_constructible' evaluate to false, and that's why
'std::array' no longer uses memcpy. This of course is totally
unnecessary as S is marked final, and so it's safe to memcpy the
vptr's and vbptr's . . . . . . . at least until Apple Silicon came
along.

Oh by the way, and this is something I only realised yesterday, the
re-signing of vptr's and vbptr's doesn't only just happen with
polymorphic objects -- it also happens with non-polymorphic objects
which have polymorphic sub-objects -- just something to keep in mind.

So if we were to change the trait "is_trivially_copy_constructible" so
that it has an extra template parameter as follows:

    template< class T, bool guaranteed_complete_object = false >
    struct is_trivially_copy_constructible;

Then it could be 'true' for:
    a) polymorphic classes that are marked final
    b) any individual object which is guaranteed to be the most-derived object
    (.......so long as we're not on Apple Silicon)

This change to "is_trivially_constructible" would not break existing
code, as old code would still use 'is_trivially_copy_constructible<T>'
which is the same as 'is_trivially_copy_constructible<T,false>'. Of
course on Apple Silicon, we would get 'false' wherever there's a vptr
or vbptr -- so the Standard would make this implementation-defined.
There's no sense in 99% of C++ implementations suffering a performance
penalty because of 1% of implementations. Apple Silicon doesn't get to
ruin all the fun for the rest of us.

Relocation is very very similar to copying or moving a container of
objects -- the only real difference being that you don't have to
destroy the source objects after moving. The trait for
trivial_relocation could be written similarly:

    template< class T, bool guaranteed_complete_object = false >
    struct is_trivially_relocatable;

You might argue that the 'guaranteed_complete_object' here is
redundant as nobody would ever try to relocate a non-complete object,
but hey I'm just keeping all doors open.

To get back to your above post, Thiago, you mentioned std::any and
QVariant. You say they type-erase copying but it's really easy to
write 'std::any' in a way that it remembers the type:

class any {
    template<typename T>
    static void *Manage(int const operation, void *const param0, void
*const param1)
    {
        if ( 0 == operation ) return (void*)&typeid(T);
        if ( 1 == operation ) { delete (T*)param0; return nullptr; }
        unreachable();
    }

    void *p = nullptr;
    void *(*fp)(int,void*,void*) = nullptr;

public:
    template<typename Tref>
    any(Tref &&arg)
    {
        typedef remove_reference_t<Tref> T;
        p = new T( forward<Tref>(arg) );
        fp = &Manage<T>;
    }

    template<typename Tref>
    any &operator=(Tref &&arg)
    {
        typedef remove_reference_t<Tref> T;
        fp(1, p, nullptr); // delete the current object
        p = new T( forward<Tref>(arg) );
        fp = &Manage<T>;
        return *this;
    }

    type_info const &type(void) const noexcept
    {
        return *(type_info*)fp(0, nullptr, nullptr); // get the type_info
    }
};

Unless I misunderstand what you mean by type-erasing the copy operation. . . ?

But anyway if we look at 'std::any', and 'QVariant', there are two
things to note:
1) Neither class is polymorphic
2) Neither class contains a polymorphic sub-object

So when you copy or move an 'any' or a 'QVariant', you're only moving
around pointers, so I don't see why 'restart_lifetime' was needed
specifically for these types -- reason being that there are no vptr's
or vbptr's to re-sign.

Oh wait . . . . . just now as I write this, I've checked the Qt
manual, and it turns out that QVariant doesn't use dynamic allocation,
i.e. the object is stored internally a sub-object, and so if that
sub-object is polymorphic, and if you move or relocate the QVariant,
then you'll have to re-sign the sub-object's vptr's and vbptr's. Same
goes for 'std::variant'. I'm wondering now Thiago if you meant to
write 'std::variant' instead of 'std::any' in your above post?

So anyway where am I going with all of this? Well in the C++
programming language we already have the copying of objects and the
moving of objects. Before we bring in the relocation of objects, I
think we should nail down copying and moving. Specifically here's what
I think we need to do with copying and moving:
    Point 1) Allow the memcpy'ing of final polymorphic classes and
also guaranteed complete objects
    Point 2) Decide what to do about Apple Silicon

To solve the Apple Silicon conundrum, there are two ways of going about it:

    Strategy 1) Have 'is_trivially_copy_constructible' be false for
all polymorphics on Apple Silicon
    Strategy 2) Have 'is_trivially_copy_constructible' be true for
some polymorphics on Apple Silicon but also provide a few 'lifetime'
functions

With regard to these 'lifetime' functions, here's how it could potentially work:
    Trivially copying is followed by copy_lifetime
    Trivially moving is followed by move_lifetime
    Trivially relocating is followed by restart_lifetime (or relocate_lifetime)

Strategy No. 2 of course would result in a bending of the English
language whereby 'trivial' wouldn't really mean 'trivial' anymore; its
meaning would change to "You can memcpy this type but on Apple Silicon
you'll have to follow it up with a lifetime function".

Received on 2026-01-29 12:30:20