std-proposals: Re: Slim mutexes and locks based on C++20 std::atomic::wait

From: Marko Mäkelä <marko.makela_at_[hidden]>
Date: Fri, 15 Oct 2021 10:51:25 +0300

Mon, Aug 23, 2021 at 02:41:47PM -0700, Thiago Macieira via Std-Proposals wrote:
>On Monday, 23 August 2021 09:49:19 PDT Marko Mäkelä via Std-Proposals wrote:
>>>As Ville said, you're going to have to do a lot of explaining here.
>>>Explicitly, you'll need to explain why this type exists in addition
>>>to std::mutex, and give a dozen examples on how to choose one or the
>>>other.
>>
>>Sure. The basic suggestion would be to always use std::mutex or
>>std::shared_mutex except when there are strong reasons to do
>>otherwise:
>>
>>(1) one would need a shared_mutex with a 3rd mode (Update) that allows
>>concurrent Shared locks, but not concurrent Update or eXclusive locks;
>>(2) the mutex has to be very small for a specific application
>>(millions or billions of instances, or embedded deep in some array)
>
>Again, you need to provide very compelling reason. I'm not sure this is
>it (I'm not in the committee, so my opinion may not be reflective of
>theirs). But think about whether this functionality should be
>standardised in the first place.

I now have practical evidence for some further compelling reasons.

A contributor noticed that implementing a spinloop for a mutex
acquisition would improve throughput in his concurrency tests.

Someone pointed out that actually such a spinloop feature does exist in
the GNU libc implementation of pthread_mutex_t. The undocumented mutex
attribute PTHREAD_ADAPTIVE_MUTEX_INITIALIZER_NP actually enables
spinloops of adaptive duration. The _NP suffix supposedly means
non-portable or non-POSIX.

Compelling reason (3): There is no spinloop-based std::mutex::lock().

I may be mistaken, but I do not think there is a way to specify a
pthread_mutexattr_t parameter to a std::mutex. Even the
std::mutex::native_handle would not help if the mutex attribute cannot
be specified after pthread_mutex_init().

Compelling reason (4): For each individual lock() operation, we may want
to specify whether to spin.

In https://jira.mariadb.org/browse/MDEV-26779 it turned out that
enabling the spinning logic of PTHREAD_ADAPTIVE_MUTEX_INITIALIZER_NP
actually reduced performance. Best performance was achieved by replacing
some (not all!) pthread_mutex_lock() operations with a custom spinloop
around pthread_mutex_trylock().

Based on this, I will have to revise my proposal to introduce member
functions like spin_lock(). For my proposed atomic_spin_mutex and
atomic_spin_shared_mutex variants of atomic_mutex and
atomic_shared_mutex, they would be equivalent to lock(), shared_lock(),
update_lock().

Compelling reason (5): Efficient implementation of lock elision based on
transactional memory.

The critical sections protected by a lock may vary a lot. There is an
implementation-defined maximum size of a memory transaction. If that is
exceeded, or if a system call would be invoked, the memory transaction
will be aborted, and fall-back code (which would acquire and release the
lock in the old-fashioned way) must be executed.

Aborting a memory transaction and re-executing code hurts performance.

If we enable lock elision in GNU libc, it would be enabled for each and
every invocation of pthread_mutex_lock() or std::mutex::lock() on a
particular mutex. Because there is no std::mutex::is_locked() or
equivalent for pthread_mutex_t, it is impossible to write your own
memory transaction that would elide those locks.

Note: I am not proposing anything regarding transactional memory. This
reason (5) is merely about facilitating that. In
https://github.com/dr-m/atomic_sync you can find sample implementations
of transactional lock guards. Here is how an Intel RTM implementation of
a custom lock guard would use is_locked() to elide shared_lock():

#ifndef NO_ELISION
inline bool xbegin()
{ return have_transactional_memory && _xbegin() == _XBEGIN_STARTED; }
inline void xabort() { _xabort(0); }
inline void xend() { _xend(); }
#endif

template<class mutex> class transactional_shared_lock_guard
{
public:
   transactional_shared_lock_guard(mutex &m) : m(m)
   {
#ifndef NO_ELISION
     if (xbegin())
     {
       if (!m.is_locked())
       {
         elided = true;
         return;
       }
       xabort();
     }
     elided = false;
#endif
     m.lock_shared();
   }
...
};

>There's a high cost for everything going in the standard, since
>multiple implementations must support it, it must be generic enough to
>work on a great number of architectures, and it can't change easily for
>the next couple of decades.

My atomic_mutex and atomic_shared_mutex proposal is based on code that
is already being deployed in a popular database server. Some similar
code (spinlock using C++11 std::atomic, without any futex) is present in
an earlier major release, which Debian appears to carry for every
instruction set architecture that it has ever supported, except the
ill-fated AVR32.

I think that the limiting factor is the availability of C++20
std::atomic::wait() and std::atomic::notify_one(). And that
implementation burden was already placed on library maintainers several
years ago.

>Moreover, you're placing an implementation burden on the standard
>library maintainers, for things that are not at all easy to implement?
>We keep finding latent bugs in QSemaphore (which I've also ported to
>futex) and even QMutex for simple things as a memory load needing to be
>Acquire when we wrote Relaxed... (see QTBUG-88247).

Unfortunately, the Total Store Ordering of IA-32, AMD64 and SPARC is
very forgiving to such bugs. Luckily, we have continuous integration
tests running on various POWER and ARM platforms, and more extensive
tests are being regularly run on ARMv8 (in addition to AMD64).
https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html nicely
demonstrates how things are different with the weaker memory model of
POWER and ARM. The RISC-V RVWMO should be similar.

>So, are there enough users for it to warrant being part of the
>standard?

If you build it, they will come. See the 5 compelling reasons above.

>>I did not implement a slim counterpart of std::condition_variable
>>because there was no need for it in my code base. If condition
>>variables are needed, a normal mutex can be used.
>
>And that's what Qt does. The moment you need a QWaitCondition, it falls
>back to a platform mutex and cond_var.

Right, because there is no std::atomic::wait_until()
[std::atomic::wait() with a timeout], it would be challenging to
implement something equivalent to std::condition_variable::wait_until().

In my repository, you can find examples/atomic_condition_variable.h that
is missing wait_until(). I do not think that it deserves to be in any
standard.

I implemented it because I thought that I might need it in the code base
that I am maintaining. The motivation would have been to use
atomic_mutex instead of pthread_mutex_t for the associated mutex, to be
able to implement lock elision in a particular code path. It turned out
that we can remove the mutex from that code path altogether.

Best regards,

        Marko

Received on 2021-10-15 02:51:35