Date: Sat, 24 Jan 2026 19:13:45 +0000
On Saturday, January 24, 2026, Arthur O'Dwyer wrote:
>
>
> You seem to be reinventing the idea of a "copy constructor."
>
I do see your line of reasoning here but let me get to the optimisation
part.
Let's say we have a 3D graphic class something like:
struct Sprite : Entity {
long double GetVolume(void) override;
long long points[1024][1024];
};
You'll agree here that the data part of this class is much much bigger than
the potentially "not memcpyable" part of the class (i.e. 8 megabytes Vs 8
bytes).
Now let's say we have a container of 6 million of these sprites, and we
want to copy the whole container. Well we can copy from the old container
to the uninitialised new container as follows:
for ( unsigned long i = 0; i != 6000000; ++i )
memcpy( q + i * sizeof(T), p + i * sizeof(T), sizeof(T) );
The above code is essentially calling a copy-constructor in a loop. But
instead of invoking the copy constructor 6 million times, we can do a "bulk
copy" as follows:
memcpy( q, p, sizeof(T)*6000000 );
Obviously the latter is a lot better (i.e. one memcpy instead of 6 million
memcpy's). But even on arm64e, we can compare the two code snippets. On
arm64e, the first snippet becomes:
for (unsigned long i = 0; i != 6000000; ++i)
{
memcpy( q + i * sizeof(T), p + i * sizeof(T), sizeof(T) );
copy_lifetime( q + i * sizeof(T), p + i * sizeof(T) );
}
And the second snippet becomes:
memcpy( q, p, sizeof(T)*6000000 );
for ( unsigned long i = 0; i != 6000000; ++i )
copy_lifetime( q + i * sizeof(T), p + i * sizeof(T) );
Now you might argue that there isn't much benefit to the second snippet as
it must iterate over each element individually -- but the thing is that
when it does so, all it does is set one measly pointer. That's a lot less
CPU trudgery than individually copying each Sprite one by one.
So the optimised copy-constructor for 'vector' would become something like:
template<typename T>
requires is_trivially_copy_constructible<T, true> // note the 'true'
indicates guaranteed complete object
vector(vector const &rhs)
{
count = rhs.count;
p = (T*)new char unsigned[ count * sizeof(T) ];
memcpy( p, rhs.p, count * sizeof(T) );
for ( unsigned long i = 0; i != count; ++i )
copy_lifetime( p + i, rhs.p + i );
}
That "for loop" will be optimised away to a no-op on every machine except
for arm64e.
This won't break old code, even though the old code doesn't call
copy_lifetime. This is because the old code will have:
is_trivially_copy_constructible<T>
instead of:
is_trivially_copy_constructible<T, true>
and so the old code won't try to use memcpy for polymorphic objects (i.e.
the old code will call the copy-constructor in a loop).
>
>
> You seem to be reinventing the idea of a "copy constructor."
>
I do see your line of reasoning here but let me get to the optimisation
part.
Let's say we have a 3D graphic class something like:
struct Sprite : Entity {
long double GetVolume(void) override;
long long points[1024][1024];
};
You'll agree here that the data part of this class is much much bigger than
the potentially "not memcpyable" part of the class (i.e. 8 megabytes Vs 8
bytes).
Now let's say we have a container of 6 million of these sprites, and we
want to copy the whole container. Well we can copy from the old container
to the uninitialised new container as follows:
for ( unsigned long i = 0; i != 6000000; ++i )
memcpy( q + i * sizeof(T), p + i * sizeof(T), sizeof(T) );
The above code is essentially calling a copy-constructor in a loop. But
instead of invoking the copy constructor 6 million times, we can do a "bulk
copy" as follows:
memcpy( q, p, sizeof(T)*6000000 );
Obviously the latter is a lot better (i.e. one memcpy instead of 6 million
memcpy's). But even on arm64e, we can compare the two code snippets. On
arm64e, the first snippet becomes:
for (unsigned long i = 0; i != 6000000; ++i)
{
memcpy( q + i * sizeof(T), p + i * sizeof(T), sizeof(T) );
copy_lifetime( q + i * sizeof(T), p + i * sizeof(T) );
}
And the second snippet becomes:
memcpy( q, p, sizeof(T)*6000000 );
for ( unsigned long i = 0; i != 6000000; ++i )
copy_lifetime( q + i * sizeof(T), p + i * sizeof(T) );
Now you might argue that there isn't much benefit to the second snippet as
it must iterate over each element individually -- but the thing is that
when it does so, all it does is set one measly pointer. That's a lot less
CPU trudgery than individually copying each Sprite one by one.
So the optimised copy-constructor for 'vector' would become something like:
template<typename T>
requires is_trivially_copy_constructible<T, true> // note the 'true'
indicates guaranteed complete object
vector(vector const &rhs)
{
count = rhs.count;
p = (T*)new char unsigned[ count * sizeof(T) ];
memcpy( p, rhs.p, count * sizeof(T) );
for ( unsigned long i = 0; i != count; ++i )
copy_lifetime( p + i, rhs.p + i );
}
That "for loop" will be optimised away to a no-op on every machine except
for arm64e.
This won't break old code, even though the old code doesn't call
copy_lifetime. This is because the old code will have:
is_trivially_copy_constructible<T>
instead of:
is_trivially_copy_constructible<T, true>
and so the old code won't try to use memcpy for polymorphic objects (i.e.
the old code will call the copy-constructor in a loop).
Received on 2026-01-24 19:13:47
