std-proposals: Re: Slim mutexes and locks based on C++20 std::atomic::wait

From: Thiago Macieira <thiago_at_[hidden]>
Date: Mon, 23 Aug 2021 14:41:47 -0700

On Monday, 23 August 2021 09:49:19 PDT Marko Mäkelä via Std-Proposals wrote:
> I would welcome suggestions on how to improve the CAS loops in
> atomic_mutex::wait_and_lock() and atomic_sux_lock::s_trylock(). Some
> performance tests on AMD64 and ARMv8 should be easy to arrange for the
> database server where similar code is present.

Always loop on a read operation, until you see a value that matches your
request. Then and only then perform the CAS or equivalent mutating operation
(which may fail).

In x86 speak, you want to loop without LOCK (usually on a CMP + PAUSE) so you
loop while the cacheline is retained in shared mode. See the Intel Software
Optimisation Manual section 11.4.2 "Synchronization for Short Periods".

> >As Ville said, you're going to have to do a lot of explaining here.
> >Explicitly, you'll need to explain why this type exists in addition to
> >std::mutex, and give a dozen examples on how to choose one or the
> >other.
>
> Sure. The basic suggestion would be to always use std::mutex or
> std::shared_mutex except when there are strong reasons to do otherwise:
>
> (1) one would need a shared_mutex with a 3rd mode (Update) that allows
> concurrent Shared locks, but not concurrent Update or eXclusive locks;
> (2) the mutex has to be very small for a specific application (millions
> or billions of instances, or embedded deep in some array)

Again, you need to provide very compelling reason. I'm not sure this is it
(I'm not in the committee, so my opinion may not be reflective of theirs). But
think about whether this functionality should be standardised in the first
place. There's a high cost for everything going in the standard, since
multiple implementations must support it, it must be generic enough to work on
a great number of architectures, and it can't change easily for the next
couple of decades. Moreover, you're placing an implementation burden on the
standard library maintainers, for things that are not at all easy to
implement? We keep finding latent bugs in QSemaphore (which I've also ported
to futex) and even QMutex for simple things as a memory load needing to be
Acquire when we wrote Relaxed... (see QTBUG-88247). A quick search through
glibc's NPTL support finds they had to fix a pthread_rwlock stall in 2019
too[*] (bug 23844).

So, are there enough users for it to warrant being part of the standard?

Or would making it part of a third-party library be more cost-effective, for
the people who do need it and know they need it, and narrow its use-case
sufficiently to allow it to be better tested?

[*] I searched for the word "fix" and there are way too many other commits
doing other things like moving portions of libpthread into libc in the last
half year, so I may have missed more recent fixes.

> I did not implement a slim counterpart of std::condition_variable
> because there was no need for it in my code base. If condition variables
> are needed, a normal mutex can be used.

And that's what Qt does. The moment you need a QWaitCondition, it falls back
to a platform mutex and cond_var.

> >std::mutex, on the other hand, explicitly provided that compatibility.
> >So now legacy exists and we need to explain why one and not the other.
>
> That sounds great! Hypothetically speaking, would it only be a matter of
> ABI compatibility?

Yes. The standard places no burden on what the native handles are, only that
there must be such a function. An implementation could decide to use thin/
simple mutexes, without PThread compatibility, and make a simple atomic_int
its native_handle.

In fact, a mutex can be as small as a single byte. See
https://webkit.org/blog/6161/locking-in-webkit/

-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel DPG Cloud Engineering

Received on 2021-08-23 16:41:54