Date: Mon, 23 Aug 2021 19:49:19 +0300
Fri, Aug 20, 2021 at 09:21:28AM -0700, Thiago Macieira via Std-Proposals wrote:
>On Thursday, 19 August 2021 22:45:09 PDT Marko Mäkelä via Std-Proposals wrote:
>> I believe that the code works wherever std::atomic::wait() is available
>> (currently, this includes MSVC on Windows, GCC or Clang on Linux with
>> libstdc++-11). There is also a fallback implementation for C++11, using
>> futex system calls on Linux or OpenBSD.
>
>Which is your best implementation by far. Aside from looping on CAS, which you
>should never do.
Sure, the loop on CAS could possibly be improved.
The atomic-based mutex is a building block of the more generic
atomic-based rw-locks, but it could also be useful on its own.
The rw-locks are composed of two futexes (or objects that support
std::atomic::wait()), to have separate wake-up queues for:
(1) shared lock requests that wait for an exclusive lock (grant and) release
(2) exclusive lock requests waiting for all shared locks to be released
The composition of two futexes was inspired by (but not directly based on)
http://locklessinc.com/articles/sleeping_rwlocks/
I had two earlier variants of the writer-prioritizing rw-lock that
performed badly. There was only 1 bit for keeping track of pending
exclusive waiters. By design, if multiple exclusive locks can be
requested concurrently, one thread would be granted the exclusive lock,
and upon releasing the lock, it could clear the flag for other waiting
exclusive request(s). To work around the resulting hangs, there were
redundant calls to futex wakeup a.k.a. std::atomic::wake(), and as you
suggested, this was very expensive. Because my atomic_mutex counts the
waiting requests in the lock word, it can avoid such unnecessary system
call overhead. The overhead was observed to cause a significant
performance impact on both Linux and Microsoft Windows.
In an even earlier variant, I tried to be clever about shared lock
acquisition and simply used fetch_add(), followed by fetch_sub()
"back-off" if the exclusive lock was already acquired or being waited
for. This resulted in starvation, witnessed by long no-progress
execution traces from https://rr-project.org/. That was the reason why I
reverted back to a CAS loop for shared lock acquisition
(atomic_sux_lock::s_trylock()).
I would welcome suggestions on how to improve the CAS loops in
atomic_mutex::wait_and_lock() and atomic_sux_lock::s_trylock(). Some
performance tests on AMD64 and ARMv8 should be easy to arrange for the
database server where similar code is present.
>As Ville said, you're going to have to do a lot of explaining here.
>Explicitly, you'll need to explain why this type exists in addition to
>std::mutex, and give a dozen examples on how to choose one or the
>other.
Sure. The basic suggestion would be to always use std::mutex or
std::shared_mutex except when there are strong reasons to do otherwise:
(1) one would need a shared_mutex with a 3rd mode (Update) that allows
concurrent Shared locks, but not concurrent Update or eXclusive locks;
(2) the mutex has to be very small for a specific application (millions
or billions of instances, or embedded deep in some array)
I did not implement a slim counterpart of std::condition_variable
because there was no need for it in my code base. If condition variables
are needed, a normal mutex can be used.
I would say that generally, std::mutex is the preferred choice, because
various debugging helps are available. When using atomic_mutex, you are
on your own, by design, because there is no 'bloat' in the underlying
data structures. You have to implement your own debug guards. Examples
of that could be provided, if the conventions of the standard allow
that.
>From experience, being the maintainer of QMutex, I can tell you that
>your simple mutex is what *I* want, and the result is that all QMutex
>are that. There's no way to access a pthread_mutex_t from the Qt API.
>There's no intention on providing compatibility.
>
>std::mutex, on the other hand, explicitly provided that compatibility.
>So now legacy exists and we need to explain why one and not the other.
That sounds great! Hypothetically speaking, would it only be a matter of
ABI compatibility?
I appreciate Ville's offer regarding helping with the editing, and I
will follow up with him off list.
Thank you,
Marko Mäkelä
>On Thursday, 19 August 2021 22:45:09 PDT Marko Mäkelä via Std-Proposals wrote:
>> I believe that the code works wherever std::atomic::wait() is available
>> (currently, this includes MSVC on Windows, GCC or Clang on Linux with
>> libstdc++-11). There is also a fallback implementation for C++11, using
>> futex system calls on Linux or OpenBSD.
>
>Which is your best implementation by far. Aside from looping on CAS, which you
>should never do.
Sure, the loop on CAS could possibly be improved.
The atomic-based mutex is a building block of the more generic
atomic-based rw-locks, but it could also be useful on its own.
The rw-locks are composed of two futexes (or objects that support
std::atomic::wait()), to have separate wake-up queues for:
(1) shared lock requests that wait for an exclusive lock (grant and) release
(2) exclusive lock requests waiting for all shared locks to be released
The composition of two futexes was inspired by (but not directly based on)
http://locklessinc.com/articles/sleeping_rwlocks/
I had two earlier variants of the writer-prioritizing rw-lock that
performed badly. There was only 1 bit for keeping track of pending
exclusive waiters. By design, if multiple exclusive locks can be
requested concurrently, one thread would be granted the exclusive lock,
and upon releasing the lock, it could clear the flag for other waiting
exclusive request(s). To work around the resulting hangs, there were
redundant calls to futex wakeup a.k.a. std::atomic::wake(), and as you
suggested, this was very expensive. Because my atomic_mutex counts the
waiting requests in the lock word, it can avoid such unnecessary system
call overhead. The overhead was observed to cause a significant
performance impact on both Linux and Microsoft Windows.
In an even earlier variant, I tried to be clever about shared lock
acquisition and simply used fetch_add(), followed by fetch_sub()
"back-off" if the exclusive lock was already acquired or being waited
for. This resulted in starvation, witnessed by long no-progress
execution traces from https://rr-project.org/. That was the reason why I
reverted back to a CAS loop for shared lock acquisition
(atomic_sux_lock::s_trylock()).
I would welcome suggestions on how to improve the CAS loops in
atomic_mutex::wait_and_lock() and atomic_sux_lock::s_trylock(). Some
performance tests on AMD64 and ARMv8 should be easy to arrange for the
database server where similar code is present.
>As Ville said, you're going to have to do a lot of explaining here.
>Explicitly, you'll need to explain why this type exists in addition to
>std::mutex, and give a dozen examples on how to choose one or the
>other.
Sure. The basic suggestion would be to always use std::mutex or
std::shared_mutex except when there are strong reasons to do otherwise:
(1) one would need a shared_mutex with a 3rd mode (Update) that allows
concurrent Shared locks, but not concurrent Update or eXclusive locks;
(2) the mutex has to be very small for a specific application (millions
or billions of instances, or embedded deep in some array)
I did not implement a slim counterpart of std::condition_variable
because there was no need for it in my code base. If condition variables
are needed, a normal mutex can be used.
I would say that generally, std::mutex is the preferred choice, because
various debugging helps are available. When using atomic_mutex, you are
on your own, by design, because there is no 'bloat' in the underlying
data structures. You have to implement your own debug guards. Examples
of that could be provided, if the conventions of the standard allow
that.
>From experience, being the maintainer of QMutex, I can tell you that
>your simple mutex is what *I* want, and the result is that all QMutex
>are that. There's no way to access a pthread_mutex_t from the Qt API.
>There's no intention on providing compatibility.
>
>std::mutex, on the other hand, explicitly provided that compatibility.
>So now legacy exists and we need to explain why one and not the other.
That sounds great! Hypothetically speaking, would it only be a matter of
ABI compatibility?
I appreciate Ville's offer regarding helping with the editing, and I
will follow up with him off list.
Thank you,
Marko Mäkelä
Received on 2021-08-23 11:49:34