Date: Mon, 25 Nov 2019 22:45:52 -0500
Hi Hans,
I like std::tm_exec as a simple solution that can try hardware (if
available) and fall back to software (initially with just a global local),
leaving room for future software fallback improvements. Given the
uncertainty about hardware and even the viability of transactions using
current libraries, we should avoid anything more complicated or elaborate
than std::tm_exec.
We'll see continued investment in speculative execution issues; CPU
architects feel it's their job to solve hard problems of maximizing code
throughput and all I've talked to seem undaunted. I've been advocating for
speculative execution hiding the latency of atomics and TSX by assuming
uncontended success and speculatively executing ahead, backtracking if that
assumption proves wrong. If we had that, we could even ask future CPUs to
guarantee strong memory ordering, and use speculation to hide the cost in
the uncontended case...
However...
1) Current TSX works but it's not 100% clear that it's worth continuing to
make available on future CPUs. The throughput cost (~60 clocks) makes it
expensive for simple DCAS style usage, and being bound by cache footprint
prevents it from being useful for large transactions. Thus what's left is
the middle, which is a benefit in some specialized database applications,
but we won't see OS kernels redesigned any time soon.
2) Expanding TSX to be unbounded by cache size has been investigated and is
possible, but it's not clear that actual applications will be able to
utilize it due to the rate of transaction conflicts, both real and false.
3) The problem of false conflicts is significant in real-world code. For
example, consider an ordered list data structure implemented as a tree.
Logically, if two transactions each add a distinct value to the ordered
list, that shouldn't cause read-write dependency. But physically, many
implementations would update parent nodes, causing false dependencies that
force the transactions to be serialized.
The problem of false conflicts isn't specific to TSX or even to hardware
transactions, but would be a flaw in any C++ language-level solution that
tracked read-write dependencies in memory. Ultimately, these read-write
tracking implementations say that computations A and B can run in parallel
if they don't have any physical write-read or write-write dependencies. But
what we really want to ask is whether the high-level observable state of
the program is commutative: if it's the same if we run A then B as if we
had run B then A. For a specific object, consider a game that handles
collision detection using an octree. The fact that a monster is moving in
my backyard shouldn't affect collision detection in my kitchen.
Various data structures would want different ways of determining
commutativity that are less conflict-prone than low-level read-write
dependencies, but the design space has barely been explored. For example,
in the monster case, we want to be able to run a lot of read-write
move_monster() operation in parallel with a bunch of read-only
check_collision() operations, and as they retire, ensure that the
return-value result of the check_collision() operations were unchanged by
the move_monster() operations, instead of asking whether they read and
wrote the same memory. Most games actually use a version of this idea to
hide network latency, by speculatively predicting local player movement,
and correcting and rerunning if wrong.
Because of TSX being limited to tracking read-write dependencies on cache
lines, and any broader solution being extremely complex, I think we'll only
see TSX used for small transactions. My advice to CPU architects is to
treat TSX as a fancy way of enabling DCAS supporting a bit of logic and
control flow. You'd use it only for small, low-level, library-like
operations like adding an element to a double-linked list in a thread-safe
way.
*Conjecture*
I believe that bigger transactions will rely on a software implementation,
which uses new transaction-friendly libraries replacing the std container
classes, and gives programmers the ability to implement data structures
that specify, in code, whether transactions commute. Ideally such a library
would provide:
- container data types (notionally immutable)
- future<t>
- var<t> for atomic and transaction-safe mutable data
- transactions based on isolated speculative execution of code that can be
committed or aborted
- an effects system for tracking read/write/exception/io/user dependencies,
sequencing, and commutativity
- concurrent garbage collection
I've been experimenting with this approach in my spare time for the past
few years. The constraints each aspect composes turn out to be fairly
complementary, but the overhead is daunting. If we could scale Fortnite
from 100 players in a map to 1000 players, we'd be happy to pay a 4x
overhead and run the simulation on a 40-core server. But I'm pretty far
from "just 4x".
-Tim
On Mon, Nov 25, 2019 at 8:09 PM Hans Boehm <boehm_at_[hidden]> wrote:
> Tim, Herb -
>
> As you may have seen, SG5 generated an initial proposal for a simplified
> transactional memory facility. See wg21.link/p1875. This was reviewed by
> SG1. See http://wiki.edg.com/bin/view/Wg21belfast/P1875r0, or my short
> summary below. We'd be very interested in any additional reactions you may
> have.
>
> One additional concern, brought up in later SG5 discussions, is that we're
> not terribly clear on the perception/status of various hardware
> transactional memory (HTM) facilities, in light of the fact that current
> best effort HTM facilities seem ideally suited for side-channel attacks
> that need to inspect cache state. Given that this facility is becoming more
> focused on exposing hardware support, this seems relevant. None of us have
> a good idea where hardware vendors are going with this.
>
> Thanks!
>
> Hans
>
> My summary of SG1 reaction to proposal [with some comments from subsequent
> SG5 discussion] :
> -------------------------------------------------------
>
> Generally people seemed positively inclined. People liked the idea of
> cutting back on the original TS. There weren't really any requests to put
> specific pieces back.
>
> Many people would like a mechanism that allows relatively low-level
> portable access to HTM facilities. That generally seemed to be the highest
> priority, and is consistent with what we've heard before.
>
> Along the same lines, there was apprehension about not being able to get
> feedback as to whether a transaction was actually executing in HTM. Several
> people wanted a way to explicitly observe retries. I think especially the
> HPC community needs something like that to diagnose the performance
> problems that otherwise arise from a single global lock fallback.
>
> There was concern that the RAII-based syntax allowed semi-nonsensical
> constructs that would be hard to handle in the implementation. What happens
> if the transaction_scope object is allocated somewhere other than on the
> stack? How would you implement that? Even in cases in which it is owned by
> a stack object, and thus does kind-of make sense, it seems problematic.
>
> We did end up taking a quick poll about the two other options in SG1, and
> it was 8 to 3 in favor of language syntax, with a significant number of
> people not voting. I have little confidence that this would go the same way
> in other groups. I may ask around to get some impression of how this might
> go in EWG, which actually has more of a voice here. [ SG5 is happy with
> going the syntax route, justified by varargs issues with the lambda
> approach, and SG1 concerns about the RAII approach. ]
>
> There was some interesting discussion of co_await in transactions, and
> "coroutines" in general. (C++ "coroutines" are very customizable, and may
> not behave much like actual coroutines. So this question is probably a bit
> more important and subtle than it seems at first glance.) We need to think
> about this when drafting wording. I currently don't have good intuitions
> about this. [ SG5 consensus afterwards was to initially not require support
> for anything coroutine-related in transactions. The current plan is to
> allow implementations to support anything they want in transactions anyway.
> ]
>
> The abuse of constexpr to define a minimal set of transaction-safe library
> facilities was viewed with suspicion. It was pointed out that we would
> probably need to list some exceptions in both directions. (Which I think is
> OK, since I view this as an opportunistic specification mechanism to avoid
> specifying transaction-safety for every standard library routine.) I did
> not hear an actual better alternative.
>
I like std::tm_exec as a simple solution that can try hardware (if
available) and fall back to software (initially with just a global local),
leaving room for future software fallback improvements. Given the
uncertainty about hardware and even the viability of transactions using
current libraries, we should avoid anything more complicated or elaborate
than std::tm_exec.
We'll see continued investment in speculative execution issues; CPU
architects feel it's their job to solve hard problems of maximizing code
throughput and all I've talked to seem undaunted. I've been advocating for
speculative execution hiding the latency of atomics and TSX by assuming
uncontended success and speculatively executing ahead, backtracking if that
assumption proves wrong. If we had that, we could even ask future CPUs to
guarantee strong memory ordering, and use speculation to hide the cost in
the uncontended case...
However...
1) Current TSX works but it's not 100% clear that it's worth continuing to
make available on future CPUs. The throughput cost (~60 clocks) makes it
expensive for simple DCAS style usage, and being bound by cache footprint
prevents it from being useful for large transactions. Thus what's left is
the middle, which is a benefit in some specialized database applications,
but we won't see OS kernels redesigned any time soon.
2) Expanding TSX to be unbounded by cache size has been investigated and is
possible, but it's not clear that actual applications will be able to
utilize it due to the rate of transaction conflicts, both real and false.
3) The problem of false conflicts is significant in real-world code. For
example, consider an ordered list data structure implemented as a tree.
Logically, if two transactions each add a distinct value to the ordered
list, that shouldn't cause read-write dependency. But physically, many
implementations would update parent nodes, causing false dependencies that
force the transactions to be serialized.
The problem of false conflicts isn't specific to TSX or even to hardware
transactions, but would be a flaw in any C++ language-level solution that
tracked read-write dependencies in memory. Ultimately, these read-write
tracking implementations say that computations A and B can run in parallel
if they don't have any physical write-read or write-write dependencies. But
what we really want to ask is whether the high-level observable state of
the program is commutative: if it's the same if we run A then B as if we
had run B then A. For a specific object, consider a game that handles
collision detection using an octree. The fact that a monster is moving in
my backyard shouldn't affect collision detection in my kitchen.
Various data structures would want different ways of determining
commutativity that are less conflict-prone than low-level read-write
dependencies, but the design space has barely been explored. For example,
in the monster case, we want to be able to run a lot of read-write
move_monster() operation in parallel with a bunch of read-only
check_collision() operations, and as they retire, ensure that the
return-value result of the check_collision() operations were unchanged by
the move_monster() operations, instead of asking whether they read and
wrote the same memory. Most games actually use a version of this idea to
hide network latency, by speculatively predicting local player movement,
and correcting and rerunning if wrong.
Because of TSX being limited to tracking read-write dependencies on cache
lines, and any broader solution being extremely complex, I think we'll only
see TSX used for small transactions. My advice to CPU architects is to
treat TSX as a fancy way of enabling DCAS supporting a bit of logic and
control flow. You'd use it only for small, low-level, library-like
operations like adding an element to a double-linked list in a thread-safe
way.
*Conjecture*
I believe that bigger transactions will rely on a software implementation,
which uses new transaction-friendly libraries replacing the std container
classes, and gives programmers the ability to implement data structures
that specify, in code, whether transactions commute. Ideally such a library
would provide:
- container data types (notionally immutable)
- future<t>
- var<t> for atomic and transaction-safe mutable data
- transactions based on isolated speculative execution of code that can be
committed or aborted
- an effects system for tracking read/write/exception/io/user dependencies,
sequencing, and commutativity
- concurrent garbage collection
I've been experimenting with this approach in my spare time for the past
few years. The constraints each aspect composes turn out to be fairly
complementary, but the overhead is daunting. If we could scale Fortnite
from 100 players in a map to 1000 players, we'd be happy to pay a 4x
overhead and run the simulation on a 40-core server. But I'm pretty far
from "just 4x".
-Tim
On Mon, Nov 25, 2019 at 8:09 PM Hans Boehm <boehm_at_[hidden]> wrote:
> Tim, Herb -
>
> As you may have seen, SG5 generated an initial proposal for a simplified
> transactional memory facility. See wg21.link/p1875. This was reviewed by
> SG1. See http://wiki.edg.com/bin/view/Wg21belfast/P1875r0, or my short
> summary below. We'd be very interested in any additional reactions you may
> have.
>
> One additional concern, brought up in later SG5 discussions, is that we're
> not terribly clear on the perception/status of various hardware
> transactional memory (HTM) facilities, in light of the fact that current
> best effort HTM facilities seem ideally suited for side-channel attacks
> that need to inspect cache state. Given that this facility is becoming more
> focused on exposing hardware support, this seems relevant. None of us have
> a good idea where hardware vendors are going with this.
>
> Thanks!
>
> Hans
>
> My summary of SG1 reaction to proposal [with some comments from subsequent
> SG5 discussion] :
> -------------------------------------------------------
>
> Generally people seemed positively inclined. People liked the idea of
> cutting back on the original TS. There weren't really any requests to put
> specific pieces back.
>
> Many people would like a mechanism that allows relatively low-level
> portable access to HTM facilities. That generally seemed to be the highest
> priority, and is consistent with what we've heard before.
>
> Along the same lines, there was apprehension about not being able to get
> feedback as to whether a transaction was actually executing in HTM. Several
> people wanted a way to explicitly observe retries. I think especially the
> HPC community needs something like that to diagnose the performance
> problems that otherwise arise from a single global lock fallback.
>
> There was concern that the RAII-based syntax allowed semi-nonsensical
> constructs that would be hard to handle in the implementation. What happens
> if the transaction_scope object is allocated somewhere other than on the
> stack? How would you implement that? Even in cases in which it is owned by
> a stack object, and thus does kind-of make sense, it seems problematic.
>
> We did end up taking a quick poll about the two other options in SG1, and
> it was 8 to 3 in favor of language syntax, with a significant number of
> people not voting. I have little confidence that this would go the same way
> in other groups. I may ask around to get some impression of how this might
> go in EWG, which actually has more of a voice here. [ SG5 is happy with
> going the syntax route, justified by varargs issues with the lambda
> approach, and SG1 concerns about the RAII approach. ]
>
> There was some interesting discussion of co_await in transactions, and
> "coroutines" in general. (C++ "coroutines" are very customizable, and may
> not behave much like actual coroutines. So this question is probably a bit
> more important and subtle than it seems at first glance.) We need to think
> about this when drafting wording. I currently don't have good intuitions
> about this. [ SG5 consensus afterwards was to initially not require support
> for anything coroutine-related in transactions. The current plan is to
> allow implementations to support anything they want in transactions anyway.
> ]
>
> The abuse of constexpr to define a minimal set of transaction-safe library
> facilities was viewed with suspicion. It was pointed out that we would
> probably need to list some exceptions in both directions. (Which I think is
> OK, since I view this as an opportunistic specification mechanism to avoid
> specifying transaction-safety for every standard library routine.) I did
> not hear an actual better alternative.
>
Received on 2019-11-25 21:48:31