To be concrete, if we want single-word reads ("word" defined as anything that can be read atomically) to be as cheap as possible in an HTM-with-SGL-as-fallback setting, I think the following should work:

1) HTM transactions are implemented as usual, with no additional overhead.
2) Fallback SGL transactions increment a version counter at the beginning and the end.
3) Single word reads read the version counter before and after, making sure they're the same and even.

I think this has the following properties:
1) It doesn't work for general read-only transactions, since an HTM transaction occurring between two reads isn't detected.
2) It's correct for single-word reads.
3) It's implementable with local information: All fallbacks to the SGL would unconditionally increment a single global counter twice. If we detect that a transaction consists of a single word read, we use this implementation instead of the HTM one.
4) The only runtime costs relative to what Michael Scott had hoped for are:
 - The fallback is slightly more expensive, since we need to increment the counter twice (while holding SGL, hence nonatomically). On non-x86 architectures, there may be fences in the fallback.
 - The single-word reads take three loads instead of one. Unless HTM fails, two of those should pretty much always be L1 cache hits. On non-x86, ordering constraints may be required. At least on x86, I think adjacent single-word load transactions can be combined to share the version reading overhead.
5) Thus in the absence of fallback, and on x86, this should only be a little less efficient than what Michael originally wanted, with a bit more, but still local and non-heroic, optimization effort.

I think in this single-read case, memory management is basically the programmer's problem either way, right?

I'm still inclined to agree that we want to allow allocation in transactions. But I think the restrictive proposal is actually not that much less useful than we originally thought; it just requires a bit more implementation work to make it useful in the way Michael wanted.


On Wed, Oct 21, 2020 at 1:49 PM Michael Spear <> wrote:
Hi Hans,

I don't know if anyone has explored (1).  There's work on (2) that Michael and I were involved in [EUROPAR 10] and also in the gcc compiler (one of their 'gl' implementations).  The difficult part (as always, it seems) is coordinating speculative reads with the possibility of concurrent memory reclamation.  It's not insurmountable, but it leads to sub-optimal performance.

- Mike

On Wed, Oct 21, 2020 at 4:36 PM Hans Boehm <> wrote:
It also occurred to me later that the single lock fallback was going to be a problem. Unfortunate.

I think that if we add a version counter:

1) To the HTM fallback only, we can get cheaper scalable single reads.
2) Always, we can get cheap scalable read-only transactions.

What's the experience with those implementation strategies? I actually kind of like those implementation directions here, in that they give transactions a large performance advantage for some kinds of code, without requiring HTM, and they require ambitious compiler analysis only for performance, not correctness. Performance is scalable with compiler effort.


On Wed, Oct 21, 2020 at 10:26 AM Michael L. Scott via SG5 <> wrote:
On Oct 21, 2020, at 10:01 AM, Michael Spear <> wrote:

I like this idea.  But note that it's going to require a lot of whole program reasoning.

I agree that it trivially applies to HTM.  But it does not apply to a single lock fallback, unless the compiler can ensure that no transaction performs two writes to /data/.  Otherwise, the internal state of the two-write transaction will be visible.  I haven't thought about the STM (especially redo case), but I suspect that it would break too.

I think both single-lock and STM are broken :-(

  T1: atomic { x++; y++; }
  T2: atomic {s = x;}  atomic {t = y;}

I see no way to preclude the case in which T2 sees the new x and the old y, with no strictly serializable execution order.

The use case is when the larger transaction does 
atomic do { data += 1; if (foo()) data += 1; else data += 3; }
If data is initially even, then we cannot allow the single read to see an odd value.

Also, if a transaction wrote {data1++; data2++}, data 1 always equals data 2, and another thread performed 
atomic do { r1 = data1;} atomic do {r2 = data2;}
We could also see intermediate state.  It's not really a different problem than making sure that STM writeback is atomic before privatization, but it applies to the single lock fallback as well as STM.

- Mike

On Wed, Oct 21, 2020 at 9:49 AM Michael L. Scott via SG5 <> wrote:
That’s a good point.  It does indeed make me happier.  One could imagine the compiler similarly recognizing single-word store and RMW idioms.  Still no way to specify relaxed ordering when you want it, though.

On Oct 21, 2020, at 12:27 AM, Hans Boehm via SG5 <> wrote:

Following up on our resolution to discuss this between meetings:

In the HTM case, since the hardware actually implements strong atomicity, for atomically accessible data, is it correct to implement

atomic do { r = data; }

as essentially 

r.load(data, memory_order_seq_cst);


That would very partially address the concern we had during the meeting, and maybe make Michael Scott happier. With some implementation effort simple transactional loads could be made cheap on x86 and maybe cheap enough on ARM. It would unfortunately require a less trivial implementation to do so.


On Tue, Oct 20, 2020 at 6:16 PM Hans Boehm <> wrote:
Attendees: Hans Boehm, Victor Luchangco, Jens Maurer (at the beginning), Michael Scott, Mike Spear, and Michael Wong

Initially discussed p2066r4 and vector::size(), and the question of whether it is and should be included in the set of guaranteed transaction-safe functions.

We agreed that the current definition of "accessors" is not clear enough. At this point, there also seemed to be close to consensus that we should try to initially keep things as simple as possible, and avoid anything that might look scary to implementors.

We then tried to switch back to the discussion of including new.

Mike Spear argued that adding new and supporting non-escaping exceptions doesn't really complicate simple implementations. Both single-global-lock and HTM implementations still work.

Mike Spear: What can you actually do with the p2066 proposal?

Michael Scott: implement lock-free data structures like linked lists that read without a lock, and then transactionally update it at the right place.

Mike Spear: No, because the initial reads can't be on atomic data.

Can't really use these transactions as an NCAS replacement. All read accesses preceding NCAS would also have to be transactional, which is too expensive for something like Intel HTM. Transactions cost as much as locks, so surrounding every read by one is not practical.

Victor L.: Transactional reads could work for tree traversal followed by update, which is probably more common.

Clarification that this avoids contention in an HTM context, but still has latency issues even in the uncontended case.

Discussion of double-checked locking, with a similar conclusion. It's not practical to put the initial read in a transaction.

We unfortunately concluded that we either need more of a usability argument for what we have, or probably a revision of both the design and wording documents to allow new + caught exceptions. If we had the latter, most of the STL data structures could become usable in transactions. Given that roll-your-own data structures don't seem terribly viable, that would give us a much better usability argument.

Closed with a strong recommendation to continue this discussion online, and not wait until the next meeting.

Next meeting:

Nov. 17, 8:00 PST, 11:00 EST
Join with Google Meet
Join by phone
‪(US) +1 208-925-0196‬ PIN: ‪255 542‬#

SG5 mailing list

SG5 mailing list

SG5 mailing list