sg5: Re: [SG5] [EXTERNAL] Fwd: TM-lite proposal

From: Herb Sutter <hsutter_at_[hidden]>
Date: Tue, 26 Nov 2019 20:53:15 +0000

Thanks for the update, Hans and everyone.

It looks like a promising direction and I’m glad to see it’s avoiding mission/feature creep.

Like Tim says, I think library-based prototypes are useful. Later/eventually if we actually standardize something in this area we will likely want atomic { } blocks, which naturally avoid some of the issues with the others (such as making it easy to require stack-based transactions). But no need to snap to that now if it’s not needed for experimentation and validation.

Tim makes excellent points about conflicts and scalability. Though if the tree examples are balanced rb/avl trees as it sounds like, IIUC the conflicts may be actually real.

Re transaction safety of libraries: I think the standard library shouldn’t be a special case, but there should be a simple general answer to whether users can use arbitrary libraries’ features (including std::, but also Boost, Qt, Poco, etc.) inside transactions. I think users are going to assume reasonable all-or-nothing behavior without knowing anything more than whether the callee touches only memory using ordinary (non-atomic<>) operations – any call tree they know does nothing but touching memory ought to Just Work, right? That would be my expectation as a user, and it’s something I can teach. If there’s something missing with that please let me know. But if so, then rather than documenting what things are transaction-safe, wouldn’t it be a more general matter of just knowing/documenting/annotating what functions in a library (including std::) are permitted to have side effects other than ordinary memory operations?

Constexpr (and, even more strongly, consteval) seems like the wrong place to draw a line between transactional/not-transactional. Constexpr is not directly related to memory operations which is the key thing. Plus we’re on a path where eventually nearly everything will be constexpr, including I/O and printf, and we already have multiple implementation examples, including:

* Andrew Sutton’s and Wyatt Childers’ implementation of P0707 and the reflection proposals has compiler.print, which is fully consteval but couldn’t be in any TM transaction.
* Sean Baxter’s Circle compiler<https://www.circle-lang.org/> already does compile-time invocation of real printf (it’s the compiler’s very first “I was successfully installed” sanity check), which again is consteval and not TM-able.

ISTM the clear line to define “function is transactional” should be “does only ordinary (non-atomic) run-time memory operations.” (We can then someday later have a separate but surely fun-filled discussion about whether/why/how consteval concurrency/std::thread/TM.)

I haven’t thought about how this interacts with coroutines.

Thanks for the note!

Herb

From: Tim Sweeney <tim.sweeney_at_[hidden]>
Sent: Monday, November 25, 2019 7:46 PM
To: boehm_at_[hidden]
Cc: sg5_at_lists.isocpp.org; Herb Sutter <hsutter_at_[hidden]>
Subject: [EXTERNAL] Fwd: TM-lite proposal

Hi Hans,

I like std::tm_exec as a simple solution that can try hardware (if available) and fall back to software (initially with just a global local), leaving room for future software fallback improvements. Given the uncertainty about hardware and even the viability of transactions using current libraries, we should avoid anything more complicated or elaborate than std::tm_exec.

We'll see continued investment in speculative execution issues; CPU architects feel it's their job to solve hard problems of maximizing code throughput and all I've talked to seem undaunted. I've been advocating for speculative execution hiding the latency of atomics and TSX by assuming uncontended success and speculatively executing ahead, backtracking if that assumption proves wrong. If we had that, we could even ask future CPUs to guarantee strong memory ordering, and use speculation to hide the cost in the uncontended case...

However...

1) Current TSX works but it's not 100% clear that it's worth continuing to make available on future CPUs. The throughput cost (~60 clocks) makes it expensive for simple DCAS style usage, and being bound by cache footprint prevents it from being useful for large transactions. Thus what's left is the middle, which is a benefit in some specialized database applications, but we won't see OS kernels redesigned any time soon.

2) Expanding TSX to be unbounded by cache size has been investigated and is possible, but it's not clear that actual applications will be able to utilize it due to the rate of transaction conflicts, both real and false.

3) The problem of false conflicts is significant in real-world code. For example, consider an ordered list data structure implemented as a tree. Logically, if two transactions each add a distinct value to the ordered list, that shouldn't cause read-write dependency. But physically, many implementations would update parent nodes, causing false dependencies that force the transactions to be serialized.

The problem of false conflicts isn't specific to TSX or even to hardware transactions, but would be a flaw in any C++ language-level solution that tracked read-write dependencies in memory. Ultimately, these read-write tracking implementations say that computations A and B can run in parallel if they don't have any physical write-read or write-write dependencies. But what we really want to ask is whether the high-level observable state of the program is commutative: if it's the same if we run A then B as if we had run B then A. For a specific object, consider a game that handles collision detection using an octree. The fact that a monster is moving in my backyard shouldn't affect collision detection in my kitchen.

Various data structures would want different ways of determining commutativity that are less conflict-prone than low-level read-write dependencies, but the design space has barely been explored. For example, in the monster case, we want to be able to run a lot of read-write move_monster() operation in parallel with a bunch of read-only check_collision() operations, and as they retire, ensure that the return-value result of the check_collision() operations were unchanged by the move_monster() operations, instead of asking whether they read and wrote the same memory. Most games actually use a version of this idea to hide network latency, by speculatively predicting local player movement, and correcting and rerunning if wrong.

Because of TSX being limited to tracking read-write dependencies on cache lines, and any broader solution being extremely complex, I think we'll only see TSX used for small transactions. My advice to CPU architects is to treat TSX as a fancy way of enabling DCAS supporting a bit of logic and control flow. You'd use it only for small, low-level, library-like operations like adding an element to a double-linked list in a thread-safe way.

Conjecture

I believe that bigger transactions will rely on a software implementation, which uses new transaction-friendly libraries replacing the std container classes, and gives programmers the ability to implement data structures that specify, in code, whether transactions commute. Ideally such a library would provide:

- container data types (notionally immutable)
- future<t>
- var<t> for atomic and transaction-safe mutable data
- transactions based on isolated speculative execution of code that can be committed or aborted
- an effects system for tracking read/write/exception/io/user dependencies, sequencing, and commutativity
- concurrent garbage collection

I've been experimenting with this approach in my spare time for the past few years. The constraints each aspect composes turn out to be fairly complementary, but the overhead is daunting. If we could scale Fortnite from 100 players in a map to 1000 players, we'd be happy to pay a 4x overhead and run the simulation on a 40-core server. But I'm pretty far from "just 4x".

-Tim

On Mon, Nov 25, 2019 at 8:09 PM Hans Boehm <boehm_at_[hidden]<mailto:boehm_at_[hidden]>> wrote:
Tim, Herb -

As you may have seen, SG5 generated an initial proposal for a simplified transactional memory facility. See wg21.link/p1875. This was reviewed by SG1. See http://wiki.edg.com/bin/view/Wg21belfast/P1875r0<https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.edg.com%2Fbin%2Fview%2FWg21belfast%2FP1875r0&data=02%7C01%7Chsutter%40microsoft.com%7C1ab6053bc2c84d3f3d0d08d772232d48%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637103367726062298&sdata=DN95Zkq6NZR1sQB2v2bOqkpV6Qqajf0%2F9R0Fhteq1a0%3D&reserved=0>, or my short summary below. We'd be very interested in any additional reactions you may have.

One additional concern, brought up in later SG5 discussions, is that we're not terribly clear on the perception/status of various hardware transactional memory (HTM) facilities, in light of the fact that current best effort HTM facilities seem ideally suited for side-channel attacks that need to inspect cache state. Given that this facility is becoming more focused on exposing hardware support, this seems relevant. None of us have a good idea where hardware vendors are going with this.

Thanks!

Hans

My summary of SG1 reaction to proposal [with some comments from subsequent SG5 discussion] :
-------------------------------------------------------

Generally people seemed positively inclined. People liked the idea of cutting back on the original TS. There weren't really any requests to put specific pieces back.

Many people would like a mechanism that allows relatively low-level portable access to HTM facilities. That generally seemed to be the highest priority, and is consistent with what we've heard before.

Along the same lines, there was apprehension about not being able to get feedback as to whether a transaction was actually executing in HTM. Several people wanted a way to explicitly observe retries. I think especially the HPC community needs something like that to diagnose the performance problems that otherwise arise from a single global lock fallback.

There was concern that the RAII-based syntax allowed semi-nonsensical constructs that would be hard to handle in the implementation. What happens if the transaction_scope object is allocated somewhere other than on the stack? How would you implement that? Even in cases in which it is owned by a stack object, and thus does kind-of make sense, it seems problematic.

We did end up taking a quick poll about the two other options in SG1, and it was 8 to 3 in favor of language syntax, with a significant number of people not voting. I have little confidence that this would go the same way in other groups. I may ask around to get some impression of how this might go in EWG, which actually has more of a voice here. [ SG5 is happy with going the syntax route, justified by varargs issues with the lambda approach, and SG1 concerns about the RAII approach. ]

There was some interesting discussion of co_await in transactions, and "coroutines" in general. (C++ "coroutines" are very customizable, and may not behave much like actual coroutines. So this question is probably a bit more important and subtle than it seems at first glance.) We need to think about this when drafting wording. I currently don't have good intuitions about this. [ SG5 consensus afterwards was to initially not require support for anything coroutine-related in transactions. The current plan is to allow implementations to support anything they want in transactions anyway. ]

The abuse of constexpr to define a minimal set of transaction-safe library facilities was viewed with suspicion. It was pointed out that we would probably need to list some exceptions in both directions. (Which I think is OK, since I view this as an opportunistic specification mechanism to avoid specifying transaction-safety for every standard library routine.) I did not hear an actual better alternative.

Received on 2019-11-26 14:57:03