std-proposals: Re: Slim mutexes and locks based on C++20 std::atomic::wait

From: Marko Mäkelä <marko.makela_at_[hidden]>
Date: Tue, 28 Sep 2021 21:58:55 +0300

Mon, Aug 30, 2021 at 10:24:02AM -0700, Thiago Macieira wrote:
>On Monday, 30 August 2021 10:07:41 PDT Ville Voutilainen wrote:
>>Sure, but there's still a difference between "it's possible that it
>>just isn't small" and "we know it's not small and can't be made
>>small". :) The former is indeed an impossible thing for the standard
>>to guarantee, but at least it allows, for some platforms, perhaps
>>many, to avoid the problem of the latter, whereas rainy wishes of
>>ABI-breaking size shrinkage in std::mutex are not realistic.
>
>I understand. And I'll defer to your opinion: you're a maintainer in
>libstdc++, so this thing is likely to come to you and your
>co-maintainers there. If you're comfortable with the complexity, who am
>I to argue?
>
>Just note I do maintain such a thing (QMutex) and I am saying it's full
>of pitfalls to get it right.

You are spot-on regarding pitfalls. Thanks to some off-list guidance
from you, I have optimized the prototype
https://github.com/dr-m/atomic_sync to some extent.

It turns out that contemporary compilers that target IA-32 or AMD64
cannot translate a single-bit operation such as fetch_or(b)&b or
fetch_and(~b)&b into anything else than a loop around LOCK CMPXCHG,
which is what compare_exchange_weak() or compare_exchange_strong()
translated to. I ended up injecting a LOCK BTS instruction using inline
assembler. On the Microsoft compiler, the next best thing is to have a
loop around both load() and compare_exchange_weak(), in such a way that
we will avoid the read-modify-write operation if the load() indicates
that the bit was already set by some other thread. And on a load/store
architecture like ARM or POWER, we really want to keep fetch_or()
instead of compare_exchange_weak().

Surprisingly, on my Intel Xeon system, CMAKE_BUILD_TYPE=RelWithDebInfo
with clang++-13 was considerably faster than the one built with g++-11.
I did not look at the generated code or try different compiler options.

When compiled with clang++-13, the test program on my system is about
35% slower when using both CPU sockets (NUMA nodes), compared to using
only half the processing power (1 CPU socket). Of course, the benchmark
sets a bad example: in real applications, one should avoid mutex
contention by splitting data structures in a reasonable way. Maybe with
some more effort spent on the spinloop implementation it could be
possible to reduce the gap between single-CPU and NUMA performance.

>And we will get feature creep that doesn't exist in QMutex. QMutex is
>part of Qt, so it isn't common in high-performance applications running
>on server products. This slim mutex, on the other hand, is a good
>candidate for that, so it may need to support transactional memory too.

I came across https://github.com/ARM-software/libTLE that defines a
mutex based on transactional memory, on 2 ISA extensions: AMD64 using
the RTM (Restricted Transactional Memory) variant of Intel TSX-NI, and
ARMv8 with TME (Transactional Memory Extensions). It basically adds an
xbegin() retry loop to lock() and an xend() to unlock(). Some state
information has to be carried, so that lock() and unlock() can
transparently fall back to acquiring the mutex if the lock elision
fails.

I think that there are use cases for both a plain atomic-based mutex and
a transactional atomic-based mutex, even in a software system that makes
use of transactional memory.

If small memory footprint is important and locking conflicts are
extremely unlikely and the worst case size critical section is very
large (say, a lock protects a number of hash table pointers, and the
linked list attached to one hash table entry is very long), we might
want to avoid lock elision capability altogether.

Instead of adding the lock elision state information to the mutex
itself, it could be wiser to use something similar to std::lock_guard,
say, transactional_lock_guard, keeping track of how the underlying mutex
is supposed to be released.

In that way, the same underlying atomic-based mutex or shared_mutex
could be used with both conventional and transactional memory. We might
only need a few more (inline) member functions for use by the
transactional_lock_guard wrapper. It might turn out that the
is_locked_or_waiting() or is_locked() predicates are sufficient.

Finally (this is getting somewhat off topic), in some cases it might be
possible to avoid accessing a mutex altogether, and directly access the
normally-mutex-protected memory within a memory transaction. Also in
this case, my proposed atomic_mutex and atomic_shared_mutex would work
as is: they would then only be acquired and released in the "fallback"
code path.

Even more off-topic: How would you envision support for transactional
memory in C++ in general? Should there be an option to map a memory
transaction abort into an exception?

To my understanding, the practical availability of transactional memory
is still somewhat limited. On the AMD64 ISA, only some (not all) Intel
processors support transactional memory, while ones manufactured by AMD
do not. So, CPU detection is mandatory, and fallback code paths must be
implemented. (But that is a given: memory transactions may be aborted
multiple times, and a fallback to full-blown locking must always be
available.)

>Anyway, can I ask that we get a spinlock before we get a mutex? Before
>C++17 it was impossible to get it right with std::atomic_flag; right,
>it's just non-obvious.

My implementation includes an optional spinloop before entering the
std::atomic::wait() sleep. It shouldn't be hard to extract that code
into a separate atomic_spin_mutex or atomic_spin_shared_mutex. In fact,
before I started looking into futex in the database server code base, I
implemented an atomic-based spinlock for a special case where the
critical section is always short: just reading, adding or removing a
reference-count protected hash table entry.

Best regards,

Marko

Received on 2021-09-28 13:59:02