sg5: Re: [SG5] Oct 20 minutes

From: Hans Boehm <boehm_at_[hidden]>
Date: Wed, 21 Oct 2020 16:18:40 -0700

To be concrete, if we want single-word reads ("word" defined as anything
that can be read atomically) to be as cheap as possible in an
HTM-with-SGL-as-fallback setting, I think the following should work:

1) HTM transactions are implemented as usual, with no additional overhead.
2) Fallback SGL transactions increment a version counter at the beginning
and the end.
3) Single word reads read the version counter before and after, making sure
they're the same and even.

I think this has the following properties:
1) It doesn't work for general read-only transactions, since an HTM
transaction occurring between two reads isn't detected.
2) It's correct for single-word reads.
3) It's implementable with local information: All fallbacks to the SGL
would unconditionally increment a single global counter twice. If we detect
that a transaction consists of a single word read, we use this
implementation instead of the HTM one.
4) The only runtime costs relative to what Michael Scott had hoped for are:
- The fallback is slightly more expensive, since we need to increment the
counter twice (while holding SGL, hence nonatomically). On non-x86
architectures, there may be fences in the fallback.
- The single-word reads take three loads instead of one. Unless HTM fails,
two of those should pretty much always be L1 cache hits. On non-x86,
ordering constraints may be required. At least on x86, I think adjacent
single-word load transactions can be combined to share the version reading
overhead.
5) Thus in the absence of fallback, and on x86, this should only be a
little less efficient than what Michael originally wanted, with a bit more,
but still local and non-heroic, optimization effort.

I think in this single-read case, memory management is basically the
programmer's problem either way, right?

I'm still inclined to agree that we want to allow allocation in
transactions. But I think the restrictive proposal is actually not that
much less useful than we originally thought; it just requires a bit more
implementation work to make it useful in the way Michael wanted.

Hans

On Wed, Oct 21, 2020 at 1:49 PM Michael Spear <mfs409_at_[hidden]> wrote:

> Hi Hans,
>
> I don't know if anyone has explored (1). There's work on (2) that Michael
> and I were involved in [EUROPAR 10] and also in the gcc compiler (one of
> their 'gl' implementations). The difficult part (as always, it seems) is
> coordinating speculative reads with the possibility of concurrent memory
> reclamation. It's not insurmountable, but it leads to sub-optimal
> performance.
>
> - Mike
>
> On Wed, Oct 21, 2020 at 4:36 PM Hans Boehm <boehm_at_[hidden]> wrote:
>
>> It also occurred to me later that the single lock fallback was going to
>> be a problem. Unfortunate.
>>
>> I think that if we add a version counter:
>>
>> 1) To the HTM fallback only, we can get cheaper scalable single reads.
>> 2) Always, we can get cheap scalable read-only transactions.
>>
>> What's the experience with those implementation strategies? I actually
>> kind of like those implementation directions here, in that they give
>> transactions a large performance advantage for some kinds of code, without
>> requiring HTM, and they require ambitious compiler analysis only for
>> performance, not correctness. Performance is scalable with compiler effort.
>>
>> Hans
>>
>> On Wed, Oct 21, 2020 at 10:26 AM Michael L. Scott via SG5 <
>> sg5_at_[hidden]> wrote:
>>
>>> On Oct 21, 2020, at 10:01 AM, Michael Spear <mfs409_at_[hidden]> wrote:
>>>
>>> I like this idea. But note that it's going to require a lot of whole
>>> program reasoning.
>>>
>>> I agree that it trivially applies to HTM. But it does not apply to a
>>> single lock fallback, unless the compiler can ensure that no transaction
>>> performs two writes to /data/. Otherwise, the internal state of the
>>> two-write transaction will be visible. I haven't thought about the STM
>>> (especially redo case), but I suspect that it would break too.
>>>
>>>
>>> I think both single-lock and STM are broken :-(
>>>
>>> Consider
>>> T1: atomic { x++; y++; }
>>> and
>>> T2: atomic {s = x;} atomic {t = y;}
>>>
>>> I see no way to preclude the case in which T2 sees the new x and the old
>>> y, with no strictly serializable execution order.
>>>
>>> The use case is when the larger transaction does
>>> atomic do { data += 1; if (foo()) data += 1; else data += 3; }
>>> If data is initially even, then we cannot allow the single read to see
>>> an odd value.
>>>
>>> Also, if a transaction wrote {data1++; data2++}, data 1 always equals
>>> data 2, and another thread performed
>>> atomic do { r1 = data1;} atomic do {r2 = data2;}
>>> We could also see intermediate state. It's not really a different
>>> problem than making sure that STM writeback is atomic before privatization,
>>> but it applies to the single lock fallback as well as STM.
>>>
>>> - Mike
>>>
>>> On Wed, Oct 21, 2020 at 9:49 AM Michael L. Scott via SG5 <
>>> sg5_at_[hidden]> wrote:
>>>
>>>> That’s a good point. It does indeed make me happier. One could
>>>> imagine the compiler similarly recognizing single-word store and RMW
>>>> idioms. Still no way to specify relaxed ordering when you want it, though.
>>>>
>>>> On Oct 21, 2020, at 12:27 AM, Hans Boehm via SG5 <sg5_at_[hidden]>
>>>> wrote:
>>>>
>>>> Following up on our resolution to discuss this between meetings:
>>>>
>>>> In the HTM case, since the hardware actually implements strong
>>>> atomicity, for atomically accessible data, is it correct to implement
>>>>
>>>> atomic do { r = data; }
>>>>
>>>> as essentially
>>>>
>>>> r.load(data, memory_order_seq_cst);
>>>>
>>>> ?
>>>>
>>>> That would very partially address the concern we had during the
>>>> meeting, and maybe make Michael Scott happier. With some implementation
>>>> effort simple transactional loads could be made cheap on x86 and maybe
>>>> cheap enough on ARM. It would unfortunately require a less trivial
>>>> implementation to do so.
>>>>
>>>> Hans
>>>>
>>>> On Tue, Oct 20, 2020 at 6:16 PM Hans Boehm <boehm_at_[hidden]> wrote:
>>>>
>>>>> Attendees: Hans Boehm, Victor Luchangco, Jens Maurer (at the
>>>>> beginning), Michael Scott, Mike Spear, and Michael Wong
>>>>>
>>>>> Initially discussed p2066r4 and vector::size(), and the question of
>>>>> whether it is and should be included in the set of guaranteed
>>>>> transaction-safe functions.
>>>>>
>>>>> We agreed that the current definition of "accessors" is not clear
>>>>> enough. At this point, there also seemed to be close to consensus that we
>>>>> should try to initially keep things as simple as possible, and avoid
>>>>> anything that might look scary to implementors.
>>>>>
>>>>> We then tried to switch back to the discussion of including new.
>>>>>
>>>>> Mike Spear argued that adding new and supporting non-escaping
>>>>> exceptions doesn't really complicate simple implementations. Both
>>>>> single-global-lock and HTM implementations still work.
>>>>>
>>>>> Mike Spear: What can you actually do with the p2066 proposal?
>>>>>
>>>>> Michael Scott: implement lock-free data structures like linked lists
>>>>> that read without a lock, and then transactionally update it at the right
>>>>> place.
>>>>>
>>>>> Mike Spear: No, because the initial reads can't be on atomic data.
>>>>>
>>>>> Can't really use these transactions as an NCAS replacement. All read
>>>>> accesses preceding NCAS would also have to be transactional, which is too
>>>>> expensive for something like Intel HTM. Transactions cost as much as locks,
>>>>> so surrounding every read by one is not practical.
>>>>>
>>>>> Victor L.: Transactional reads could work for tree traversal followed
>>>>> by update, which is probably more common.
>>>>>
>>>>> Clarification that this avoids contention in an HTM context, but still
>>>>> has latency issues even in the uncontended case.
>>>>>
>>>>> Discussion of double-checked locking, with a similar conclusion. It's
>>>>> not practical to put the initial read in a transaction.
>>>>>
>>>>> We unfortunately concluded that we either need more of a usability
>>>>> argument for what we have, or probably a revision of both the design and
>>>>> wording documents to allow new + caught exceptions. If we had the latter,
>>>>> most of the STL data structures could become usable in transactions. Given
>>>>> that roll-your-own data structures don't seem terribly viable, that would
>>>>> give us a much better usability argument.
>>>>>
>>>>> Closed with a strong recommendation to continue this discussion
>>>>> online, and not wait until the next meeting.
>>>>>
>>>>> Next meeting:
>>>>>
>>>>> Nov. 17, 8:00 PST, 11:00 EST
>>>>> Join with Google Meet
>>>>> meet.google.com/sbj-cvgh-vnd
>>>>> Join by phone
>>>>> ‪(US) +1 208-925-0196‬ PIN: ‪255 542‬#
>>>>>
>>>>>
>>>>>
>>>>> --
>>>> SG5 mailing list
>>>> SG5_at_[hidden]
>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg5
>>>>
>>>>
>>>> --
>>>> SG5 mailing list
>>>> SG5_at_[hidden]
>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg5
>>>>
>>>
>>> --
>>> SG5 mailing list
>>> SG5_at_[hidden]
>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg5
>>>
>>

Received on 2020-10-21 18:18:54