On Mon, Jan 6, 2020 at 7:01 PM Tim Sweeney <tim.sweeney@epicgames.com> wrote:

My two cents: Drop the constexpr or other wording about what functions are to be transaction safe, and say an implementation must ensure that any concurrent execution of an atomic block with defined behavior must be observably equivalent to taking a global lock.

Today that works for:
- a software global lock
- HTM implementations that fall back to a global lock

In the future, this would also work for more thorough compiler-driven software implementations of transactional memory that abort and take a global lock if any non-transaction-safe library function is called, like I/O, or any extern function is called that may have been compiled without transaction support, such as a OS function or C library.

We've actually discussed this quite a bit. The current intent is to allow that, but not require it, in the next TS, with the hope of getting feedback about whether this is the correct approach. The advantage is clearly that it's nice and simple. The disadvantage is that for an STM or hybrid TM implementation it turns lots of things into subtle non-portable, hard-to-address performance problems. I think there are arguments on both sides.

You're right that if we decide to go this way in the eventual standard, we can drop the constexpr stuff. Which would be nice.

Generally, I think we’ll see TSX style transactions benefit real world apps over the next few years, and I don’t think compiler-level software transactional memory implementations will ever be success, due to the overhead and inability to know what state really is shared and needs to pay the cost of transaction tracking. IMO the sweet spot for unbounded transactions will be C++ libraries with explicit transactional variables and new container classes designed for transaction-friendly, mostly-functional use cases. The current proposal is plenty to achieve these libraries layered on top of bounded hardware transactions in standard C++.

Tim

On Jan 6, 2020, at 7:57 PM, Hans Boehm <boehm@acm.org> wrote:

Apologies for the very late follow-up here.

Thanks for the responses!

I'm not sure we communicated the use of constexpr here properly. At least my view is the problems we are trying to solve here are:

1) Assuming we don't have a mechanism for explicitly informing the compiler that certain functions are callable from transactions, nontrivial software implementations will only work if the compiler can see all the code in a transaction. In a purely HTM environment, or with the trivial global lock implementation, that isn't necessary. But we don't want to preclude software implementations. I expect that precluding software implementations would also provoke opposition from some vendors. We conjecture that the vast majority of constexpr functions will have that property, since otherwise they couldn't be executed at compile time.

2) We conjecture that most constexpr functions will “do only ordinary (non-atomic) run-time memory operations.” (I don't understand the comment about compile-time IO; it seems to me that needs wildly different semantics, since the IO device may literally not exist yet at compile time. Can you clarify?) There will be exceptions.

3) We don't want to have to specify transaction-safety for every single standard library function. But in general, whether a standard library function is defined in a header file or a separate translation unit is an implementation property. We're looking for a solution to get us out of maybe 95% of that business.

4) We realize that constexpr is somewhat of a specification hack here. It's not semantically the right property. We're hoping it's close enough to provide a reasonable default. The text will say "unless otherwise specified", and we fully expect there will be cases in which we will have to explicitly specify. We're more than open to better specification hacks. But I think we need to deal with (1).

I agree that in many cases, it would be really nice if we could detect conflicts at a higher level. That seems like it's still a research issue? It also seems to require at least the kind of hairy nested transaction support that we've been trying to get away from? x++ commutes with x++, but to take advantage of that inside concurrent transactions, they individually need to run transactionally as well, right?

On Tue, Nov 26, 2019 at 12:54 PM Herb Sutter <hsutter@microsoft.com> wrote:

Thanks for the update, Hans and everyone.

It looks like a promising direction and I’m glad to see it’s avoiding mission/feature creep.

Like Tim says, I think library-based prototypes are useful. Later/eventually if we actually standardize something in this area we will likely want atomic { } blocks, which naturally avoid some of the issues with the others (such as making it easy to require stack-based transactions). But no need to snap to that now if it’s not needed for experimentation and validation.

Tim makes excellent points about conflicts and scalability. Though if the tree examples are balanced rb/avl trees as it sounds like, IIUC the conflicts may be actually real.

Re transaction safety of libraries: I think the standard library shouldn’t be a special case, but there should be a simple general answer to whether users can use arbitrary libraries’ features (including std::, but also Boost, Qt, Poco, etc.) inside transactions. I think users are going to assume reasonable all-or-nothing behavior without knowing anything more than whether the callee touches only memory using ordinary (non-atomic<>) operations – any call tree they know does nothing but touching memory ought to Just Work, right? That would be my expectation as a user, and it’s something I can teach. If there’s something missing with that please let me know. But if so, then rather than documenting what things are transaction-safe, wouldn’t it be a more general matter of just knowing/documenting/annotating what functions in a library (including std::) are permitted to have side effects other than ordinary memory operations?

Constexpr (and, even more strongly, consteval) seems like the wrong place to draw a line between transactional/not-transactional. Constexpr is not directly related to memory operations which is the key thing. Plus we’re on a path where eventually nearly everything will be constexpr, including I/O and printf, and we already have multiple implementation examples, including:

Andrew Sutton’s and Wyatt Childers’ implementation of P0707 and the reflection proposals has compiler.print, which is fully consteval but couldn’t be in any TM transaction.
Sean Baxter’s Circle compiler already does compile-time invocation of real printf (it’s the compiler’s very first “I was successfully installed” sanity check), which again is consteval and not TM-able.

ISTM the clear line to define “function is transactional” should be “does only ordinary (non-atomic) run-time memory operations.” (We can then someday later have a separate but surely fun-filled discussion about whether/why/how consteval concurrency/std::thread/TM.)

I haven’t thought about how this interacts with coroutines.

Thanks for the note!

Herb

From: Tim Sweeney <tim.sweeney@epicgames.com>
Sent: Monday, November 25, 2019 7:46 PM
To: boehm@acm.org
Cc: sg5@lists.isocpp.org; Herb Sutter <hsutter@microsoft.com>
Subject: [EXTERNAL] Fwd: TM-lite proposal

Hi Hans,

I like std::tm_exec as a simple solution that can try hardware (if available) and fall back to software (initially with just a global local), leaving room for future software fallback improvements. Given the uncertainty about hardware and even the viability of transactions using current libraries, we should avoid anything more complicated or elaborate than std::tm_exec.

We'll see continued investment in speculative execution issues; CPU architects feel it's their job to solve hard problems of maximizing code throughput and all I've talked to seem undaunted. I've been advocating for speculative execution hiding the latency of atomics and TSX by assuming uncontended success and speculatively executing ahead, backtracking if that assumption proves wrong. If we had that, we could even ask future CPUs to guarantee strong memory ordering, and use speculation to hide the cost in the uncontended case...

However...

1) Current TSX works but it's not 100% clear that it's worth continuing to make available on future CPUs. The throughput cost (~60 clocks) makes it expensive for simple DCAS style usage, and being bound by cache footprint prevents it from being useful for large transactions. Thus what's left is the middle, which is a benefit in some specialized database applications, but we won't see OS kernels redesigned any time soon.

2) Expanding TSX to be unbounded by cache size has been investigated and is possible, but it's not clear that actual applications will be able to utilize it due to the rate of transaction conflicts, both real and false.

3) The problem of false conflicts is significant in real-world code. For example, consider an ordered list data structure implemented as a tree. Logically, if two transactions each add a distinct value to the ordered list, that shouldn't cause read-write dependency. But physically, many implementations would update parent nodes, causing false dependencies that force the transactions to be serialized.

The problem of false conflicts isn't specific to TSX or even to hardware transactions, but would be a flaw in any C++ language-level solution that tracked read-write dependencies in memory. Ultimately, these read-write tracking implementations say that computations A and B can run in parallel if they don't have any physical write-read or write-write dependencies. But what we really want to ask is whether the high-level observable state of the program is commutative: if it's the same if we run A then B as if we had run B then A. For a specific object, consider a game that handles collision detection using an octree. The fact that a monster is moving in my backyard shouldn't affect collision detection in my kitchen.

Various data structures would want different ways of determining commutativity that are less conflict-prone than low-level read-write dependencies, but the design space has barely been explored. For example, in the monster case, we want to be able to run a lot of read-write move_monster() operation in parallel with a bunch of read-only check_collision() operations, and as they retire, ensure that the return-value result of the check_collision() operations were unchanged by the move_monster() operations, instead of asking whether they read and wrote the same memory. Most games actually use a version of this idea to hide network latency, by speculatively predicting local player movement, and correcting and rerunning if wrong.

Because of TSX being limited to tracking read-write dependencies on cache lines, and any broader solution being extremely complex, I think we'll only see TSX used for small transactions. My advice to CPU architects is to treat TSX as a fancy way of enabling DCAS supporting a bit of logic and control flow. You'd use it only for small, low-level, library-like operations like adding an element to a double-linked list in a thread-safe way.

Conjecture

I believe that bigger transactions will rely on a software implementation, which uses new transaction-friendly libraries replacing the std container classes, and gives programmers the ability to implement data structures that specify, in code, whether transactions commute. Ideally such a library would provide:

- container data types (notionally immutable)

- future<t>

- var<t> for atomic and transaction-safe mutable data

- transactions based on isolated speculative execution of code that can be committed or aborted

- an effects system for tracking read/write/exception/io/user dependencies, sequencing, and commutativity

- concurrent garbage collection

I've been experimenting with this approach in my spare time for the past few years. The constraints each aspect composes turn out to be fairly complementary, but the overhead is daunting. If we could scale Fortnite from 100 players in a map to 1000 players, we'd be happy to pay a 4x overhead and run the simulation on a 40-core server. But I'm pretty far from "just 4x".

-Tim

On Mon, Nov 25, 2019 at 8:09 PM Hans Boehm <boehm@acm.org> wrote:

Tim, Herb -

As you may have seen, SG5 generated an initial proposal for a simplified transactional memory facility. See wg21.link/p1875. This was reviewed by SG1. See http://wiki.edg.com/bin/view/Wg21belfast/P1875r0, or my short summary below. We'd be very interested in any additional reactions you may have.

One additional concern, brought up in later SG5 discussions, is that we're not terribly clear on the perception/status of various hardware transactional memory (HTM) facilities, in light of the fact that current best effort HTM facilities seem ideally suited for side-channel attacks that need to inspect cache state. Given that this facility is becoming more focused on exposing hardware support, this seems relevant. None of us have a good idea where hardware vendors are going with this.

Thanks!

Hans

My summary of SG1 reaction to proposal [with some comments from subsequent SG5 discussion] :

-------------------------------------------------------

Generally people seemed positively inclined. People liked the idea of cutting back on the original TS. There weren't really any requests to put specific pieces back.

Many people would like a mechanism that allows relatively low-level portable access to HTM facilities. That generally seemed to be the highest priority, and is consistent with what we've heard before.

Along the same lines, there was apprehension about not being able to get feedback as to whether a transaction was actually executing in HTM. Several people wanted a way to explicitly observe retries. I think especially the HPC community needs something like that to diagnose the performance problems that otherwise arise from a single global lock fallback.

There was concern that the RAII-based syntax allowed semi-nonsensical constructs that would be hard to handle in the implementation. What happens if the transaction_scope object is allocated somewhere other than on the stack? How would you implement that? Even in cases in which it is owned by a stack object, and thus does kind-of make sense, it seems problematic.

We did end up taking a quick poll about the two other options in SG1, and it was 8 to 3 in favor of language syntax, with a significant number of people not voting. I have little confidence that this would go the same way in other groups. I may ask around to get some impression of how this might go in EWG, which actually has more of a voice here. [ SG5 is happy with going the syntax route, justified by varargs issues with the lambda approach, and SG1 concerns about the RAII approach. ]

There was some interesting discussion of co_await in transactions, and "coroutines" in general. (C++ "coroutines" are very customizable, and may not behave much like actual coroutines. So this question is probably a bit more important and subtle than it seems at first glance.) We need to think about this when drafting wording. I currently don't have good intuitions about this. [ SG5 consensus afterwards was to initially not require support for anything coroutine-related in transactions. The current plan is to allow implementations to support anything they want in transactions anyway. ]

The abuse of constexpr to define a minimal set of transaction-safe library facilities was viewed with suspicion. It was pointed out that we would probably need to list some exceptions in both directions. (Which I think is OK, since I view this as an opportunistic specification mechanism to avoid specifying transaction-safety for every standard library routine.) I did not hear an actual better alternative.