std-proposals: Re: Slim mutexes and locks based on C++20 std::atomic::wait

From: Marko Mäkelä <marko.makela_at_[hidden]>
Date: Mon, 4 Oct 2021 10:25:28 +0300

Tue, Sep 28, 2021 at 01:03:06PM -0700, Thiago Macieira wrote:
>On Tuesday, 28 September 2021 11:58:55 PDT Marko Mäkelä wrote:
[snip]
>>I think that there are use cases for both a plain atomic-based mutex
>>and a transactional atomic-based mutex, even in a software system that
>>makes use of transactional memory.
>>
>>If small memory footprint is important and locking conflicts are
>>extremely unlikely and the worst case size critical section is very
>>large (say, a lock protects a number of hash table pointers, and the
>>linked list attached to one hash table entry is very long), we might
>>want to avoid lock elision capability altogether.
>
>Right. I don't know why glibc maintainers decided to switch from a
>per-mutex choice to a global one. We can research that off-list and
>maybe come back to suggest different types of mutexes for the C++
>Standard Library.

I updated https://github.com/dr-m/atomic_sync with
transactional_lock_guard, transactional_shared_lock_guard,
transactional_update_lock_guard that are simple RAII wrappers around
my proposed atomic_mutex and atomic_shared_mutex, using Intel Restricetd
Transactional Memory (RTM).

I am not sure whether the transactional lock guards would belong to the
standard library in the near future, before language support has been
implemented. I only wrote the RAII wrappers to demonstrate that my
proposed atomic_mutex and atomic_shared_mutex are already compatible
with transactional memory.

Because I do not have access to ARM hardware where the Transactional
Memory Extension (TME) would be available, I did not implement support
for that. Neither did I look up how to access the instructions on the
Microsoft Visual C/C++ compiler. So, for now, it is only for GCC, clang
and compatible compilers targeting IA-32 or AMD64.

As expected, the running time of test/test_atomic_sync will be increased
when RTM is enabled. This must simply be due to the large number of
tx-abort events and re-execution. If you can supply a more realistic yet
simple test program, that would be very much appreciated.

I also updated some operations to be more IA-32 and AMD64 friendly. I'm
explicitly invoking the 80386 instructions LOCK BTS or LOCK BTR, and in
some cases invoking the 80486 LOCK XADD to reset (actually toggle) the
most significant bit. For std::atomic<uint32_t> and the value x=1U<<31,
the operations fetch_xor(x), fetch_add(x), and fetch_sub(x) are
equivalent. I just checked that both fetch_add(1U<<31) and
fetch_sub(1U<<31) translate nicely to the same LOCK XADD, but no
compiler would be smart enough to translate fetch_xor(1U<<31) into that.

Side note: Based on a quick test on godbolt.org, only clang (since
version 3.6) seems to optimize fetch_add(0), fetch_sub(0), fetch_or(0),
fetch_xor(0), fetch_and(~0) to MOV preceded by MFENCE. Curiously, on any
compiler that I tested, load() will translate into a simple MOV without
any MFENCE. For the trivial fetch_ operations, other compilers than
clang would emit LOCK XADD or a loop around LOCK CMPXCHG.

Best regards,

Marko Mäkelä

Received on 2021-10-04 02:25:35