Date: Thu, 27 Nov 2025 11:05:45 +0100
> On 27 Nov 2025, at 10:38, David Brown <david.brown_at_[hidden]> wrote:
>
> On 27/11/2025 09:30, Hans Åberg via Std-Proposals wrote:
>>> On 26 Nov 2025, at 16:47, Thiago Macieira via Std-Proposals <std-proposals_at_[hidden]> wrote:
>>>
>>> On Wednesday, 26 November 2025 01:16:27 Pacific Standard Time David Brown via
>>> Std-Proposals wrote:
>>>> (It is also entirely reasonable to suppose that specific
>>>> 128-bit integers might be supported more efficiently than arbitrary
>>>> sized _BitInt types.)
>>>
>>> int128_t should be supported exactly as efficiently as _BitInt(128), no faster
>>> and no slower, because it should *be exactly the same support*. Different
>>> types, but since it's exactly the same size and behaviour, it should be using
>>> the exact same implementation. Like long with either int or long long. Yes,
>>> this is a QoI, but it's an easy fix for a compiler to specifically detect that
>>> the _BitInt has the same size as another integer it already supports, one I
>>> expect all compilers to do and also one I expect users to request if not.
>> There are special ways to optimize for powers of 2 and also the square of a modulus, so it might be prudent to add support for uint<2^2^n>_t for some suitable n, perhaps up to uint4096_t or something. If the type is not available natively, it is possible to make a recursive template implementation that encourages pipelining—I have done so for division, which is the hard part.
>
> Thiago's point (assuming I understand him correctly) - which I agree with - is that in a good implementation of _BitInt, the compiler will use such specialised implementations for _BitInt when the size is a power of 2 (128 in particular). The size of a _BitInt is known at compile time, so the compiler can generate slow, general code for _BitInt(255) and _BitInt(257) and then fast code for _BitInt(256).
>
> I would expect a good implementation to handle _BitInt(N) just like normal N-bit integers when N matches standard integer types. It should handle _BitInt(128) just like __int128, or whatever it might have as a 128-bit extension. It should handle N = 2 ^ m values efficiently. Assuming a chunk size of 64 bits, it should handle N = 64 * m well, because it can skip masking operations. And a /really/ good implementation could have specialisations using SIMD or vector registers for different sizes, or fancy multiplication algorithms for larger sizes - all while keeping smaller sizes simpler. That's all up to the quality of the implementation. The point is, since the size is fixed at compile time, the compiler can figure out the best strategy for the size in question.
Indeed, but if this ia a C compatibility type, how do you do that without templates?
>>> Other _BitInt sizes, including smaller ones, could have worse performance.
>>> It's up to the user to determine a size that meets their needs and achieves
>>> the performance they require.
>> Even though C++ bit_int<k> types could specialize to such implementations, I suspect they will typically use a multiple-word implementation, perhaps derived from the C implementation which does not have such templates.
>
> You can try gcc and clang with C support for _BitInt on godbolt today. (gcc only supports x86 and AArch64 targets at the moment - I'm hoping they widen that target support soon.)
>
> <https://godbolt.org/z/59q6j7Phc>
>
> Both handle _BitInt(128) exactly the same as they handle their __int128 types. For larger types, clang generates large amounts of inline code, while gcc farms things off to compiler support library functions. Both have (IMHO) plenty of scope for improvement, but of course it makes sense to see how many people actually use these features before prioritising the work.
The Clang 2-by-2 word implementation for uint128_t, the function __udivmodti4, uses binary division, which is slow.
https://github.com/llvm/llvm-project/blob/main/compiler-rt/lib/builtins/udivmodti4.c
I made one that uses an optimization of the much faster full word division algorithm.
So it's pretty much a theoretical hope.
> I expect the C++ implementations not just to be derived from the C implementations, but identical to the C implementations. For all overlapping features, C _BitInt(N) and C++ bit_int<n> are identical. And for now, that appears to be specialised handling of _BitInt(128) and smaller types, and generic handling of bigger types (sometimes inline, sometimes with library functions).
So then the optimization implementations I speak about cannot be done by means of an automatic template method.
> And I would hope that if these compilers ever implemented types like uint256_t, uint512_t, etc., with fast operations (perhaps using SIMD units where available), then they would also use these same faster operations for _BitInt(256), _BitInt(512), etc. Which means that there is no point in implementing types like uint256_t on these compilers - they can just use "typedef unsigned _BitInt(256) uint256_t;". (There might be some benefit in having those types in compilers that don't implement _BitInt(N), or have much more limited range for N.)
With the type of template recursion I mentioned, one gets an efficient default implementation that can be overridden by means of specializations in the case of hardware support. So one does not have to implement these types one-by-one.
>
> On 27/11/2025 09:30, Hans Åberg via Std-Proposals wrote:
>>> On 26 Nov 2025, at 16:47, Thiago Macieira via Std-Proposals <std-proposals_at_[hidden]> wrote:
>>>
>>> On Wednesday, 26 November 2025 01:16:27 Pacific Standard Time David Brown via
>>> Std-Proposals wrote:
>>>> (It is also entirely reasonable to suppose that specific
>>>> 128-bit integers might be supported more efficiently than arbitrary
>>>> sized _BitInt types.)
>>>
>>> int128_t should be supported exactly as efficiently as _BitInt(128), no faster
>>> and no slower, because it should *be exactly the same support*. Different
>>> types, but since it's exactly the same size and behaviour, it should be using
>>> the exact same implementation. Like long with either int or long long. Yes,
>>> this is a QoI, but it's an easy fix for a compiler to specifically detect that
>>> the _BitInt has the same size as another integer it already supports, one I
>>> expect all compilers to do and also one I expect users to request if not.
>> There are special ways to optimize for powers of 2 and also the square of a modulus, so it might be prudent to add support for uint<2^2^n>_t for some suitable n, perhaps up to uint4096_t or something. If the type is not available natively, it is possible to make a recursive template implementation that encourages pipelining—I have done so for division, which is the hard part.
>
> Thiago's point (assuming I understand him correctly) - which I agree with - is that in a good implementation of _BitInt, the compiler will use such specialised implementations for _BitInt when the size is a power of 2 (128 in particular). The size of a _BitInt is known at compile time, so the compiler can generate slow, general code for _BitInt(255) and _BitInt(257) and then fast code for _BitInt(256).
>
> I would expect a good implementation to handle _BitInt(N) just like normal N-bit integers when N matches standard integer types. It should handle _BitInt(128) just like __int128, or whatever it might have as a 128-bit extension. It should handle N = 2 ^ m values efficiently. Assuming a chunk size of 64 bits, it should handle N = 64 * m well, because it can skip masking operations. And a /really/ good implementation could have specialisations using SIMD or vector registers for different sizes, or fancy multiplication algorithms for larger sizes - all while keeping smaller sizes simpler. That's all up to the quality of the implementation. The point is, since the size is fixed at compile time, the compiler can figure out the best strategy for the size in question.
Indeed, but if this ia a C compatibility type, how do you do that without templates?
>>> Other _BitInt sizes, including smaller ones, could have worse performance.
>>> It's up to the user to determine a size that meets their needs and achieves
>>> the performance they require.
>> Even though C++ bit_int<k> types could specialize to such implementations, I suspect they will typically use a multiple-word implementation, perhaps derived from the C implementation which does not have such templates.
>
> You can try gcc and clang with C support for _BitInt on godbolt today. (gcc only supports x86 and AArch64 targets at the moment - I'm hoping they widen that target support soon.)
>
> <https://godbolt.org/z/59q6j7Phc>
>
> Both handle _BitInt(128) exactly the same as they handle their __int128 types. For larger types, clang generates large amounts of inline code, while gcc farms things off to compiler support library functions. Both have (IMHO) plenty of scope for improvement, but of course it makes sense to see how many people actually use these features before prioritising the work.
The Clang 2-by-2 word implementation for uint128_t, the function __udivmodti4, uses binary division, which is slow.
https://github.com/llvm/llvm-project/blob/main/compiler-rt/lib/builtins/udivmodti4.c
I made one that uses an optimization of the much faster full word division algorithm.
So it's pretty much a theoretical hope.
> I expect the C++ implementations not just to be derived from the C implementations, but identical to the C implementations. For all overlapping features, C _BitInt(N) and C++ bit_int<n> are identical. And for now, that appears to be specialised handling of _BitInt(128) and smaller types, and generic handling of bigger types (sometimes inline, sometimes with library functions).
So then the optimization implementations I speak about cannot be done by means of an automatic template method.
> And I would hope that if these compilers ever implemented types like uint256_t, uint512_t, etc., with fast operations (perhaps using SIMD units where available), then they would also use these same faster operations for _BitInt(256), _BitInt(512), etc. Which means that there is no point in implementing types like uint256_t on these compilers - they can just use "typedef unsigned _BitInt(256) uint256_t;". (There might be some benefit in having those types in compilers that don't implement _BitInt(N), or have much more limited range for N.)
With the type of template recursion I mentioned, one gets an efficient default implementation that can be overridden by means of specializations in the case of hardware support. So one does not have to implement these types one-by-one.
Received on 2025-11-27 10:06:01
