Date: Wed, 8 May 2024 16:05:17 -0600 (MDT)
On Thu, 9 May 2024, Andrey Semashev via Std-Discussion wrote:
> On 5/8/24 19:39, Thiago Macieira via Std-Discussion wrote:
>> On Tuesday 7 May 2024 22:19:57 GMT-7 Nate Eldredge via Std-Discussion wrote:
>>>
>>> However, many architectures
>>> have load-link/store-conditional instructions (e.g. ARM64 with LDRX/STRX)
>>> and could implement this more efficiently by executing f() in between the
>>> LL and SC. So in effect, fetch_update would expose the LL/SC
>>> functionality for use at the level of C++. It could also be implemented
>>> with transactional memory where available.
>>
>> That's interesting, but f needs to be extremely simple, otherwise that
>> machinery can't be used and the store-conditional will always fail. On some
>> architectures (and my LL/SC is very rusty), you cannot place function calls
>> between the two, so f must be always inlined.
>
> Another example might be no memory access instructions within the LL/SC
> pair. Even if they are formally allowed, page faults may get in the way.
> Which would make the whole concept extremely unreliable, as the compiler
> normally doesn't have those kind of restrictions and e.g. is free to
> generate loads from the function object state or bound variables. With
> optimizations disabled, the compiler will also generate loads and stores
> on the stack for every variable.
So I guess those cases would fall back to compare-exchange also.
I fully agree that this is only going to be beneficial in practice if f is
"extremely simple", even if we cannot precisely define what that means in
general, and that a smart programmer would only use it in that way. But I
think it could at least still behave correctly when used in a non-smart
way, even if not with optimal performance, by falling back to
compare-exchange, either at compile time or run time.
> I did consider implementing this concept for Boost.Atomic, but
> eventually decided that even if I did, the LL/SC instructions would not
> be the right backend for it. It would have to be either a CAS loop or
> something like x86 RTM. We know RTM is basically dead now, and I'm not
> sure there are equivalents on other popular architectures,
ARMv9 has defined a Transactional Memory Extension, see
https://developer.arm.com/documentation/ddi0617/latest. I don't know if
there are any actual hardware implementations as of yet, or if there are
likely to be.
> so we're
> basically left with a CAS loop implementation only. This doesn't offer
> performance benefits, likely the opposite (because with a proper CAS
> loop you can optimize away an extra atomic load), so I dropped the idea.
I suppose there could be an option for the user to pass in a recently
loaded value, if one is available, to save the initial load of the CAS
loop. An LL/SC implementation could ignore it.
Thiago also makes a good point that there is still value in having a
library function to perform a CAS loop, since in some cases there are
machine-specific optimizations that could be inserted.
> On 5/8/24 19:39, Thiago Macieira via Std-Discussion wrote:
>> On Tuesday 7 May 2024 22:19:57 GMT-7 Nate Eldredge via Std-Discussion wrote:
>>>
>>> However, many architectures
>>> have load-link/store-conditional instructions (e.g. ARM64 with LDRX/STRX)
>>> and could implement this more efficiently by executing f() in between the
>>> LL and SC. So in effect, fetch_update would expose the LL/SC
>>> functionality for use at the level of C++. It could also be implemented
>>> with transactional memory where available.
>>
>> That's interesting, but f needs to be extremely simple, otherwise that
>> machinery can't be used and the store-conditional will always fail. On some
>> architectures (and my LL/SC is very rusty), you cannot place function calls
>> between the two, so f must be always inlined.
>
> Another example might be no memory access instructions within the LL/SC
> pair. Even if they are formally allowed, page faults may get in the way.
> Which would make the whole concept extremely unreliable, as the compiler
> normally doesn't have those kind of restrictions and e.g. is free to
> generate loads from the function object state or bound variables. With
> optimizations disabled, the compiler will also generate loads and stores
> on the stack for every variable.
So I guess those cases would fall back to compare-exchange also.
I fully agree that this is only going to be beneficial in practice if f is
"extremely simple", even if we cannot precisely define what that means in
general, and that a smart programmer would only use it in that way. But I
think it could at least still behave correctly when used in a non-smart
way, even if not with optimal performance, by falling back to
compare-exchange, either at compile time or run time.
> I did consider implementing this concept for Boost.Atomic, but
> eventually decided that even if I did, the LL/SC instructions would not
> be the right backend for it. It would have to be either a CAS loop or
> something like x86 RTM. We know RTM is basically dead now, and I'm not
> sure there are equivalents on other popular architectures,
ARMv9 has defined a Transactional Memory Extension, see
https://developer.arm.com/documentation/ddi0617/latest. I don't know if
there are any actual hardware implementations as of yet, or if there are
likely to be.
> so we're
> basically left with a CAS loop implementation only. This doesn't offer
> performance benefits, likely the opposite (because with a proper CAS
> loop you can optimize away an extra atomic load), so I dropped the idea.
I suppose there could be an option for the user to pass in a recently
loaded value, if one is available, to save the initial load of the CAS
loop. An LL/SC implementation could ignore it.
Thiago also makes a good point that there is still value in having a
library function to perform a CAS loop, since in some cases there are
machine-specific optimizations that could be inserted.
-- Nate Eldredge nate_at_[hidden]
Received on 2024-05-08 22:05:22