Date: Thu, 7 Aug 2025 18:31:03 +0200
> On 7 Aug 2025, at 16:04, Tiago Freire <tmiguelf_at_[hidden]> wrote:
>
> Hi Hans,
Hi,
> I'm very interested in looking on how you are using these functions.
An example of how to use the function “div” I mentioned, which shows that it is easy:
std::pair<uint64_t, uint64_t> div32(uint64_t a1, uint64_t a0, uint64_t b)
{
// Optionally shift left a1, a2, b to get high bit of b set,
// by amount std::countl_zero(b).
uint32_t as[4], bs[2], qs[4];
// Put a1, a0 into as, and b in bs by splitting into uint32_t parts.
div(as, 4, bs, 2, qs); // Compute quotient and remainder.
uint64_t q, r;
// Put as into r and qs into q by merging high and low uint32_t
// If shifting as and bs to the left above, shift right r with the same amount.
return {q, r};
}
> I'm currently in the process of trying to rewrite the proposal. Your previous suggestion for a fused multiply add seemed sensible and I'm trying to come up with an implementation for it, before making a decision of either or not to include it.
I am doing some ARM64 assembly. Even though it has fused MADD, that does not seem to work when doing 128-bit multiplication. For that, one has to use UMULH, and it does not seem to integrate well with the lower word addition. (Also see example below.)
So I arrive at functions:
uint64_t mul(uint64_t a, uint64_t b)
uint64_t mul_add(uint64_t a, uint64_t b, uint64_t c)
std::pair<uint64_t, uint64_t> mul_wide(uint64_t a, uint64_t b)
std::pair<uint64_t, uint64_t> mul_add_wide(uint64_t a, uint64_t b, uint64_t c)
> In the last meeting the paper was criticized for how the signed variants, and I'm trying to decide between:
> a) dropping signed variants. Since they are not important for what I want to do and makes the paper much easier to defend.
> b) go all in stick to the original design a write a whole lot more of justification to defend it.
I am working only with the unsigned ones, so for me it would suffice to drop the signed parts. The signed ones are not guaranteed to be modular in overflows, so they are useless to me. And there are different ways to implement the sign, two's complement or with a separate sign.
So you might drop the signed part for now, and make that a separate proposal if need would arise.
> And seeing how other people are using these may help me take a decision on direction.
I did some testing with inlining by hand to see what the effect that has on performance, but now I am switching back to functions that can be replaced with assembly code at need.
In this example, the functions “mul” and “add” correspond to what is available on ARM64, but could be optimized with a more fine-grained instruction set:
std::pair<uint64_t, uint64_t> mul_add(uint64_t a, uint64_t b, uint64_t c) {
uint64_t r0, r1;
std::tie(r0, r1) = mul(a, b);
bool of; // Overflow;
std::tie(r0, of) = add(r0, c);
r1 += of;
return {r0, r1};
}
Also, the return types you have, like div_result<T>, may be better than the std::pair and std::tuple that I have used for convenience, because the compiler may not be able to pull off all layers.
>
> Hi Hans,
Hi,
> I'm very interested in looking on how you are using these functions.
An example of how to use the function “div” I mentioned, which shows that it is easy:
std::pair<uint64_t, uint64_t> div32(uint64_t a1, uint64_t a0, uint64_t b)
{
// Optionally shift left a1, a2, b to get high bit of b set,
// by amount std::countl_zero(b).
uint32_t as[4], bs[2], qs[4];
// Put a1, a0 into as, and b in bs by splitting into uint32_t parts.
div(as, 4, bs, 2, qs); // Compute quotient and remainder.
uint64_t q, r;
// Put as into r and qs into q by merging high and low uint32_t
// If shifting as and bs to the left above, shift right r with the same amount.
return {q, r};
}
> I'm currently in the process of trying to rewrite the proposal. Your previous suggestion for a fused multiply add seemed sensible and I'm trying to come up with an implementation for it, before making a decision of either or not to include it.
I am doing some ARM64 assembly. Even though it has fused MADD, that does not seem to work when doing 128-bit multiplication. For that, one has to use UMULH, and it does not seem to integrate well with the lower word addition. (Also see example below.)
So I arrive at functions:
uint64_t mul(uint64_t a, uint64_t b)
uint64_t mul_add(uint64_t a, uint64_t b, uint64_t c)
std::pair<uint64_t, uint64_t> mul_wide(uint64_t a, uint64_t b)
std::pair<uint64_t, uint64_t> mul_add_wide(uint64_t a, uint64_t b, uint64_t c)
> In the last meeting the paper was criticized for how the signed variants, and I'm trying to decide between:
> a) dropping signed variants. Since they are not important for what I want to do and makes the paper much easier to defend.
> b) go all in stick to the original design a write a whole lot more of justification to defend it.
I am working only with the unsigned ones, so for me it would suffice to drop the signed parts. The signed ones are not guaranteed to be modular in overflows, so they are useless to me. And there are different ways to implement the sign, two's complement or with a separate sign.
So you might drop the signed part for now, and make that a separate proposal if need would arise.
> And seeing how other people are using these may help me take a decision on direction.
I did some testing with inlining by hand to see what the effect that has on performance, but now I am switching back to functions that can be replaced with assembly code at need.
In this example, the functions “mul” and “add” correspond to what is available on ARM64, but could be optimized with a more fine-grained instruction set:
std::pair<uint64_t, uint64_t> mul_add(uint64_t a, uint64_t b, uint64_t c) {
uint64_t r0, r1;
std::tie(r0, r1) = mul(a, b);
bool of; // Overflow;
std::tie(r0, of) = add(r0, c);
r1 += of;
return {r0, r1};
}
Also, the return types you have, like div_result<T>, may be better than the std::pair and std::tuple that I have used for convenience, because the compiler may not be able to pull off all layers.
Received on 2025-08-07 16:31:24