Date: Tue, 28 Sep 2021 13:03:06 -0700
On Tuesday, 28 September 2021 11:58:55 PDT Marko Mäkelä via Std-Proposals
wrote:
> When compiled with clang++-13, the test program on my system is about
> 35% slower when using both CPU sockets (NUMA nodes), compared to using
> only half the processing power (1 CPU socket). Of course, the benchmark
> sets a bad example: in real applications, one should avoid mutex
> contention by splitting data structures in a reasonable way. Maybe with
> some more effort spent on the spinloop implementation it could be
> possible to reduce the gap between single-CPU and NUMA performance.
Simplifying a hard hardware problem, the issue here is that a contended data
structure inside of one socket can be resolved by the socket and it has some
fairness algorithms as well as access to the L3 cache. When that splits across
NUMA nodes to more than one socket, those things go out the window and the
contention is resolved by the memory controller.
You really want to avoid this scenario. This becomes a scheduling problem more
than a locking problem.
> I came across https://github.com/ARM-software/libTLE that defines a
> mutex based on transactional memory, on 2 ISA extensions: AMD64 using
> the RTM (Restricted Transactional Memory) variant of Intel TSX-NI, and
> ARMv8 with TME (Transactional Memory Extensions). It basically adds an
> xbegin() retry loop to lock() and an xend() to unlock(). Some state
> information has to be carried, so that lock() and unlock() can
> transparently fall back to acquiring the mutex if the lock elision
> fails.
glibc NPTL's pthread_mutex also has that. Which means std::mutex with either
libstdc++ or libc++ can access it. It used to be a separate mutex type, which
would require you to re-init the mutex behind std::mutex's back (a "void
warranty" problem), but it looks like nowadays you can enable it by setting an
environment variable. See
https://www.gnu.org/software/libc/manual/html_node/Elision-Tunables.html
> I think that there are use cases for both a plain atomic-based mutex and
> a transactional atomic-based mutex, even in a software system that makes
> use of transactional memory.
>
> If small memory footprint is important and locking conflicts are
> extremely unlikely and the worst case size critical section is very
> large (say, a lock protects a number of hash table pointers, and the
> linked list attached to one hash table entry is very long), we might
> want to avoid lock elision capability altogether.
Right. I don't know why glibc maintainers decided to switch from a per-mutex
choice to a global one. We can research that off-list and maybe come back to
suggest different types of mutexes for the C++ Standard Library.
> Even more off-topic: How would you envision support for transactional
> memory in C++ in general? Should there be an option to map a memory
> transaction abort into an exception?
Wasn't there a study group on this?
https://isocpp.org/std/the-committee says SG5 was Transactional Memory. Is it
still active? Maybe ping the chairs.
wrote:
> When compiled with clang++-13, the test program on my system is about
> 35% slower when using both CPU sockets (NUMA nodes), compared to using
> only half the processing power (1 CPU socket). Of course, the benchmark
> sets a bad example: in real applications, one should avoid mutex
> contention by splitting data structures in a reasonable way. Maybe with
> some more effort spent on the spinloop implementation it could be
> possible to reduce the gap between single-CPU and NUMA performance.
Simplifying a hard hardware problem, the issue here is that a contended data
structure inside of one socket can be resolved by the socket and it has some
fairness algorithms as well as access to the L3 cache. When that splits across
NUMA nodes to more than one socket, those things go out the window and the
contention is resolved by the memory controller.
You really want to avoid this scenario. This becomes a scheduling problem more
than a locking problem.
> I came across https://github.com/ARM-software/libTLE that defines a
> mutex based on transactional memory, on 2 ISA extensions: AMD64 using
> the RTM (Restricted Transactional Memory) variant of Intel TSX-NI, and
> ARMv8 with TME (Transactional Memory Extensions). It basically adds an
> xbegin() retry loop to lock() and an xend() to unlock(). Some state
> information has to be carried, so that lock() and unlock() can
> transparently fall back to acquiring the mutex if the lock elision
> fails.
glibc NPTL's pthread_mutex also has that. Which means std::mutex with either
libstdc++ or libc++ can access it. It used to be a separate mutex type, which
would require you to re-init the mutex behind std::mutex's back (a "void
warranty" problem), but it looks like nowadays you can enable it by setting an
environment variable. See
https://www.gnu.org/software/libc/manual/html_node/Elision-Tunables.html
> I think that there are use cases for both a plain atomic-based mutex and
> a transactional atomic-based mutex, even in a software system that makes
> use of transactional memory.
>
> If small memory footprint is important and locking conflicts are
> extremely unlikely and the worst case size critical section is very
> large (say, a lock protects a number of hash table pointers, and the
> linked list attached to one hash table entry is very long), we might
> want to avoid lock elision capability altogether.
Right. I don't know why glibc maintainers decided to switch from a per-mutex
choice to a global one. We can research that off-list and maybe come back to
suggest different types of mutexes for the C++ Standard Library.
> Even more off-topic: How would you envision support for transactional
> memory in C++ in general? Should there be an option to map a memory
> transaction abort into an exception?
Wasn't there a study group on this?
https://isocpp.org/std/the-committee says SG5 was Transactional Memory. Is it
still active? Maybe ping the chairs.
-- Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org Software Architect - Intel DPG Cloud Engineering
Received on 2021-09-28 15:03:13