ISOCPP Std Discussion List: Evolution of allocator for big mem chunks (on Linux)

From: Maciej Polanski <maciejpolanski75_at_[hidden]>
Date: Tue, 2 Aug 2022 19:07:47 +0200

Hello,

I’m reaching out to you about a potential reform of the allocator, which
I certainly won't do myself. However, I decided to toss the ideas out
here to not be lost.

1. Idea

Currently `std::allocator` have single `allocate()` call, that is used
for all allocation types.

I, on the other hand, would like to point out the possible benefits of
using specialized calls depending on the expected memory usage.
The example below is for Linux and std::vector, but similar
optimizations may be possible for other containers, systems or operations.

So, ideally, in the future std::vector should make different allocator
calls depending on the operation it performs.

2. Example times

The test consisted of filling, for the first time and on Mint Linux, a
new reserved std::vector with different allocators. Size >=128KB, so the
allocation is passed to `mmap()`.
Indeed, I used an array of vectors for better precision. Thus the test
involved the allocation of ~gigabytes of RAM.

My `allocPopulate` allocator enforces full, partial and none (for
comparison) allocation of physical memory frames by `mmap()`.

+--Data size: 4882 MB, 5000 elements of size 10240 in 100 vectors--+
Default allocator: 1765 ms
allocPopulate popFull: 850 ms
allocPopulate popHalf: 1276 ms
allocPopulate popNone: 1745 ms
+---+

This speed increase is so large only since some Mint kernel update.
Before that, it was much smaller, around 20%. Something most likely
could have been incorrectly changed under the kernel's hood.

On my github you can find two result files: debian.txt and mint.txt.

3. Variation of results

One important thing to mention - performance strongly depends on the
platform, code and a million other factors that are difficult to predict.
For example, on Debian the improvement is much smaller, about 16%, as it
used to be on Mint.

Surprisingly, on Debian `popNone`, the total lack of pre-population, is
not necessarily the slowest! A look at the small number of page errors
suggests that the kernel was using, for reasons unknown to me, a bulk
allocation of about 80KB (20 frames) each.

Thus, such a large number of factors affecting performance makes it
difficult to predict or guarantee improvements in any particular case.
So I don't guarantee any improvement, I'm just trying to show the
potential direction of memory allocator development for C ++.

4. Short on libc allocation

Within the linear C++ memory model, there is a reasonable expectation
that all variables are placed in some pre-allocated continuous area of
memory, and indeed this is generally how it works. However, in the Linux
libc implementation that I am familiar with, above a certain threshold,
any allocation is performed directly by the `mmap()` system call.
Comments in glibc describe this as an attempt to avoid dangerous levels
of memory fragmentation.

So my allocator affected only those large, `mmap()` allocations.

Traditionally, the threshold was usually around 128KB. Recently, the
underlying libc handles the threshold in a more sophisticated way, so I
set it with `mallopt(M_MMAP_THRESHOLD, 128KB);` to make the default and
my tests comparable.

5. Short on Linux optimistic memory allocation

Steps to allocate memory chunk on Linux is that std::allocator uses
libc, that uses `mmap()` system call to allocate memory. But as result
it receives only address space, with no physical memory allocation
behind it!
This system is sometimes called "optimistic memory allocation", as it
assumes that memory will be available only when it is needed.

After such unexisting memory is accessed, a "minor fault" is raised and
system assigns memory to process. Such an approach allows to save some
memory and, initially at least, had a really small footprint on program
speed. It happened only once per 4KB page and only page table operations
were needed.

But since then security become a real concern, so whole memory page has
to be zeroed before assigning to a process. This takes time and can be
faster in bulk mode. So this, in my opinion, can be a reason that using
a physically pre-allocated memory can be faster than using default
optimistic allocation. Another reason may be fewer minor faults and
related system calls.

6. Implications

So if the intention is to make full use of a large vector, calling
`reserve()` won't do much, the first use will be delayed anyway. But
using the `mmap(...MAP_POPULATE...)` allocator, forcing immediate
physical allocation of all requested memory, can noticeably speed up the
vector.

But what if memory is allocated for a new vector without using
`reserve()`, when it is not certain that all of it will be consumed?
Since the allocator does not know the reason for the memory allocation,
we risk wasting RAM by forcing unnecessary physical allocations and
reducing the memory pool available to the system.

It seems to me that there should be three versions of allocation for a
vector:
- no pre-population for ordinary initialization
- half pre-population for reallocation, because then it is certain that
at least half of the memory will be used
- full pre-population, when the `reserve()` call declares the use of the
entire space

In addition, it would be nice if an allocator could be defined with only
one, default, allocation, and some intermediate layer could map it for
the other cases.

So, ideally, std::vector should make different allocator calls depending
on the operation.

7.Conclusion

In conclusion, I hope this example can suggest to someone some ideas for
potential improvements in memory management. The idea of an allocator
having multiple calls for different applications seems a long way off to
me, but, well, it has been suggested now.
I think then even more operations, such as moving data, could also be
moved to the allocator, which could be useful, for example, for
"trivially movable" classes.

8. Source code

Project: https://github.com/MaciejPolanski/v

Allocator itself: `mm_alloc.h`, example usage:

using allocPopulateFull = mm::allocPopulate<T,
mm::allocPopulate_base::popFull>;
using allocPopulateHalf = mm::allocPopulate<T,
mm::allocPopulate_base::popHalf>;
using allocPopulateNone = mm::allocPopulate<T,
mm::allocPopulate_base::popNone>;

A benchmark/test is `main.cpp`

Two example result files: debian.txt and mint.txt.

-- 
Regards,
Maciej Polanski
MaciejPolanski75_at_[hidden]

Received on 2022-08-02 17:07:51

std-discussion