Date: Thu, 12 Jun 2025 11:29:38 +0000
Hi Hans,
The suggestion that div_wide should have 2 words for the result (+ remainder) instead of the current 1 because of the potential overflow, is something that has come up before.
But there are two reasons not to do this:
1. Transforming 2 words into 1 word is the intended behavior. It's exactly what you want to use in a "reduction" algorithm (the process intentionally reduces the number of words at play).
Packing the high bits (in order to try and improve the estimate and reduce the number of required passes) is technic that I have seen, and there's the temptation to pack the 2 dividend words right from the start, but that is not what you are supposed to do.
Let us say for example that you have
uint64_t dividend_hi = x;
uint64_t dividend_low = y;
uint64_t divisor = y;
and you can not guarantee that dividend_hi < divisor
then the algorithm that should be used is this:
auto result_hi = div_wide(0, dividend_hi, divisor);
auto result_low = div_wide(result_hi.remainder, dividend_low, divisor);
where result_hi.quotient and result_low.quotient are the 2 words that your are looking for.
i.e. you split it into 2 divisions where the first 1 has the high dividend bits set to 0, and then you use the remainder of the first as the high bits of the second.
And you can chain any number of dividend words using this process.
2. CPU's don't have such an instruction. An equivalent div_wide with only 1 word quotient exists on x86, but a 2 word quotient doesn't exist on any platform, and I assume that this is the case because the designers of the CPU had this reducing division algorithm in mind.
So, in order to support it, an implementer would always need to write such function in terms of other operations that exists.
And if that's the case, why couldn't it just be a library? Why does it have to be a feature that the standard should support?
I just don't want to defend that position in front of the committee.
A portable "div_wide" is not something I can write myself, and therefore I need the help of the standard to do that. But a 2 word quotient division is something that I would probably write using the single word quotient "div_wide" if one was available.
But I'm curious on how you would optimize this code using CPU traps? What would be the assembly here?
Best regards,
Tiago
-----Original Message-----
From: Hans Åberg <haberg_1_at_icloud.com>
Sent: Thursday, June 12, 2025 11:16 AM
To: Tiago Freire <tmiguelf_at_[hidden]>
Cc: std-proposals_at_lists.isocpp.org
Subject: Re: Integer overflow arithmetic with exceptions
I made a class basic_natural<Word> for multiprecision unsigned integers that includes multiprecision division:
It would be good to have overflow reported in the function that divides two words with one, as that may happen in multiprecision division with more than one word. When dividing with just one word, as in the example of your proposal, it is not needed.
A method to divide with more than one word is to shift left so that the top word of the divisor has the high bit set, and one can then show that the division of the two top words of the (also shifted left) dividend overshoots the correct value with at most 2. But it means that overflow can and will occur.
The optimization I mentioned earlier, to have the lowest word always present, I have implemented via a class nonempty_vector that can be used instead of std::vector:
template <class A>
class nonempty_vector {
A a0_ = A();
std::vector<A> as_;
…
};
It may be of independent interest, though I do not know any other use case.
Currently, one can switch between using this class and std::vector, but I will probably remove the latter at some point.
Possibly one can optimize further if there is access to CPU traps.
The suggestion that div_wide should have 2 words for the result (+ remainder) instead of the current 1 because of the potential overflow, is something that has come up before.
But there are two reasons not to do this:
1. Transforming 2 words into 1 word is the intended behavior. It's exactly what you want to use in a "reduction" algorithm (the process intentionally reduces the number of words at play).
Packing the high bits (in order to try and improve the estimate and reduce the number of required passes) is technic that I have seen, and there's the temptation to pack the 2 dividend words right from the start, but that is not what you are supposed to do.
Let us say for example that you have
uint64_t dividend_hi = x;
uint64_t dividend_low = y;
uint64_t divisor = y;
and you can not guarantee that dividend_hi < divisor
then the algorithm that should be used is this:
auto result_hi = div_wide(0, dividend_hi, divisor);
auto result_low = div_wide(result_hi.remainder, dividend_low, divisor);
where result_hi.quotient and result_low.quotient are the 2 words that your are looking for.
i.e. you split it into 2 divisions where the first 1 has the high dividend bits set to 0, and then you use the remainder of the first as the high bits of the second.
And you can chain any number of dividend words using this process.
2. CPU's don't have such an instruction. An equivalent div_wide with only 1 word quotient exists on x86, but a 2 word quotient doesn't exist on any platform, and I assume that this is the case because the designers of the CPU had this reducing division algorithm in mind.
So, in order to support it, an implementer would always need to write such function in terms of other operations that exists.
And if that's the case, why couldn't it just be a library? Why does it have to be a feature that the standard should support?
I just don't want to defend that position in front of the committee.
A portable "div_wide" is not something I can write myself, and therefore I need the help of the standard to do that. But a 2 word quotient division is something that I would probably write using the single word quotient "div_wide" if one was available.
But I'm curious on how you would optimize this code using CPU traps? What would be the assembly here?
Best regards,
Tiago
-----Original Message-----
From: Hans Åberg <haberg_1_at_icloud.com>
Sent: Thursday, June 12, 2025 11:16 AM
To: Tiago Freire <tmiguelf_at_[hidden]>
Cc: std-proposals_at_lists.isocpp.org
Subject: Re: Integer overflow arithmetic with exceptions
I made a class basic_natural<Word> for multiprecision unsigned integers that includes multiprecision division:
It would be good to have overflow reported in the function that divides two words with one, as that may happen in multiprecision division with more than one word. When dividing with just one word, as in the example of your proposal, it is not needed.
A method to divide with more than one word is to shift left so that the top word of the divisor has the high bit set, and one can then show that the division of the two top words of the (also shifted left) dividend overshoots the correct value with at most 2. But it means that overflow can and will occur.
The optimization I mentioned earlier, to have the lowest word always present, I have implemented via a class nonempty_vector that can be used instead of std::vector:
template <class A>
class nonempty_vector {
A a0_ = A();
std::vector<A> as_;
…
};
It may be of independent interest, though I do not know any other use case.
Currently, one can switch between using this class and std::vector, but I will probably remove the latter at some point.
Possibly one can optimize further if there is access to CPU traps.
Received on 2025-06-12 11:29:45