std-proposals: Re: Slim mutexes and locks based on C++20 std::atomic::wait

From: Marko Mäkelä <marko.makela_at_[hidden]>
Date: Tue, 24 Aug 2021 10:01:53 +0300

Mon, Aug 23, 2021 at 02:41:47PM -0700, Thiago Macieira via Std-Proposals wrote:
>On Monday, 23 August 2021 09:49:19 PDT Marko Mäkelä via Std-Proposals wrote:
>> I would welcome suggestions on how to improve the CAS loops in
>> atomic_mutex::wait_and_lock() and atomic_sux_lock::s_trylock(). Some
>> performance tests on AMD64 and ARMv8 should be easy to arrange for the
>> database server where similar code is present.
>
>Always loop on a read operation, until you see a value that matches your
>request. Then and only then perform the CAS or equivalent mutating operation
>(which may fail).

Thank you. I realized that atomic_mutex::wait_and_lock() can be
simplified to the following (omitting the assertions and the optional
spinloop):

void atomic_mutex::wait_and_lock() noexcept
{
   uint32_t lk = 1 + fetch_add(1, std::memory_order_relaxed);
   for (;;)
   {
     lk = fetch_or(HOLDER, std::memory_order_relaxed);
     if (!(lk & HOLDER))
     {
       std::atomic_thread_fence(std::memory_order_acquire);
       return;
     }
     wait(lk);
   }
}

When the mutex is held by a single thread with no waiters, the lock word
value will be HOLDER+1 instead of the value HOLDER that the previous
implementation used.

For atomic_sux_lock::s_trylock(), I did not figure out any way to
replace the compare-and-swap loop. Already before my refactoring of the
locks, the database server code base that I work on used a CAS loop for
acquiring a read-lock on a page.

I updated https://github.com/dr-m/atomic_sync with this and also tested
the corresponding change in the database server code base. It did
slightly improve throughput even when binding all threads to a single
NUMA node. With NUMA the impact might have been larger.

>Again, you need to provide very compelling reason. I'm not sure this is
>it

Sure. We will see how it goes.

>But think about whether this functionality should be standardised in
>the first place. There's a high cost for everything going in the
>standard, since multiple implementations must support it, it must be
>generic enough to work on a great number of architectures, and it can't
>change easily for the next couple of decades.

Right. A header-only or header-mostly implementation could introduce new
type of ABI compatibility trouble in case an implementation used the
wrong memory ordering somewhere.

>So, are there enough users for it to warrant being part of the
>standard?
>
>Or would making it part of a third-party library be more
>cost-effective, for the people who do need it and know they need it,
>and narrow its use-case sufficiently to allow it to be better tested?

Good points. I thought that having this in the standard library would
create more pressure to operating system developers to provide some
futex-like functionality.

A third-party library such as Boost would be a good fallback solution if
this effort fails. In any case, it could provide a fallback solution for
those who are cannot adopt the latest version of the standard yet.

>In fact, a mutex can be as small as a single byte. See
>https://webkit.org/blog/6161/locking-in-webkit/

Sure, a spinlock can be trivially implemented in 1 byte. Microsoft
Windows does allow a 1-byte futex, but on Linux (and supposedly other
systems) a futex must be associated with exactly 4 bytes in user mode
address space.

Using a 1-byte futex, one could implement a 1-byte mutex for at most 128
concurrent threads. That might be fine in a special case, but certainly
insufficient in a generic case. Some form of indirection (like the
ParkingLot in the above page) would be needed to ensure that on release,
a system call will be invoked if and only if pending requests exist.

In practice, having an implementation in a third-party library could be
better for the near term. I think that a long-term solution that avoids
the need for third-party libraries is worth some effort.

A software project that is distributed in source code form might better
stick to something that compiles out-of-the-box with whatever compiler
the currently oldest supported GNU/Linux distribution is shipping. That
might restrict the code base to a subset of C++11. Something that was
introduced in C++20 or C++23 could still be years away from widespread
adoption.

Best regards,

        Marko Mäkelä

Received on 2021-08-24 02:02:03