ISOCPP Std Discussion List: Re: Float exceptions and std::atomic

From: Myria <myriachan_at_[hidden]>
Date: Thu, 24 Aug 2023 00:01:20 -0700

On Wed, Aug 23, 2023 at 23:20 Thiago Macieira via Std-Discussion <
std-discussion_at_[hidden]> wrote:

> On Wednesday, 23 August 2023 22:27:48 PDT Myria via Std-Discussion wrote:
>
> > LDMXCSR has a 7-cycle latency, but this whole sequence’s timing is
> > dominated by the 100+-cycle lock cmpxchg.
>
> CMPXCHG can be much faster than 100 cycles, depending on just how close
> that
> cacheline is to the CPU executing the instruction. If it hasn't lost the
> cacheline since the last write operation, it could be as low as 14 cycles.
>
…snip…

>
> Don't forget that the moves between register files (the MOVD instructions)
> are
> also non-trivial and cost 3 cycles each. So there's an argument that
> locking a
> mutex and performing this addition a single time can be less expensive
> than
> iterating multiple times, because you're paying the same cost for CMPXCHG,
> but
> you're definitely doing a single addition and no register file movements.

I kind of doubt it, but that’s definitely possible. If it’s close in
performance, sticking with lock cmpxchg is probably better even if slightly
slower due to the ability to use it in shared memory.

> Here's what compilers generate: https://gcc.godbolt.org/z/Pc53W87Ps
>
> Looks like state-of-the-art still has room for improvement. All four
> compilers
> decided to load the value that STMXCSR saved to the stack to a regisre,
> even
> though they had to re-save it to the stack again in order to LDMXCSR at
> the
> end. A case where the register allocation went too aggressive.
>
> I was also scratching my head at why GCC decided to skip the register file
> operation and instead go through memory. My guess is that it's trying to
> optimise port usage: the addition and the register file move could contend
> for
> port 0. The old ICC did the same, so I think it's on purpose, not a glitch.
>
> And WTF is MSVC doing?
>

MSVC seems to think that changing MXCSR trashes all vector registers or
something. xmm6-xmm15 are nonvolatile in the Windows x86-64 ABI. (16-31
are volatile, as are bits above 128 for 6-15.). Maybe it’s some ABI rule
that I don’t know about.

> Anyway, what this shows is that those who want to work on environments
> with
> unmasked exceptions can achieve it. So why should everyone else pay the
> price?
>

True, but then it’s something the Standard should reconsider. None of the
implementations comply with the Standard here.

But changing it would also be a little philosophically inconsistent with
how atomic<int> and atomic<T*> are allowed to overflow without undefined
behavior.

I for one would love to throw FP environment stuff and FP exceptions into
the trash. So much code out there just breaks if you’re not masking
everything and using default rounding…

Melissa

Received on 2023-08-24 07:01:33