1. Introduction
I propose to expand the bit manipulation library with the following functions:
// all constraints are exposition-only template < unsigned_integral T > constexpr T reverse_bits ( T x ) noexcept ;
Reverses the bits of
so that the least significant bit becomes the most significant.
template < unsigned_integral T > constexpr T compress_bitsr ( T x , T m ) noexcept ;
For each one-bit in
, the corresponding bit in
is taken and packed
contiguously into the result, starting with the least significant result bit.
template < unsigned_integral T > constexpr T compress_bitsl ( T x , T m ) noexcept ;
For each one-bit in
, the corresponding bit in
is taken and packed
contiguously into the result, starting with the most significant result bit.
template < unsigned_integral T > constexpr T expand_bitsr ( T x , T m ) noexcept ;
For each one-bit in
, a bit from
, starting with the least significant bit
is taken and shifted into the corresponding position of the
bit.
template < unsigned_integral T > constexpr T expand_bitsl ( T x , T m ) noexcept ;
For each one-bit in
, a bit from
, starting with the most significant bit
is taken and shifted into the corresponding position of the
bit.
2. Motivation and scope
Bit-reversal, compression, and expansion are fundamental operations that meet multiple criteria which make them suitable for standardization:
-
They are common and useful operations.
-
They can be used to implement numerous other operations.
-
At least on some architectures, they have direct hardware support.
-
They are non-trivial to implement efficiently in software.
-
For known masks, numerous optimization opportunities are available.
2.1. Applications
2.1.1. Applications of reverse_bits
Bit-reversal is a common operation with uses in:
-
Cryptography: scrambling bits
-
Networking: as part of cyclic redundancy check computation
-
Graphics: mirroring of images with one bit per pixel
-
Random number generation: reversal can counteract low entropy of low-order bits such as in the case of linear congruential generators
-
Digital signal processing: for radix-2 Cooley-Tukey FFT algorithms
-
Code obfuscation: security by obscurity
2.1.2. Applications of compress_bits
and expand_bits
Compression and expansion are also common, with uses in:
-
Space-filling curves: Morton/Z-Order and Hilbert curves
-
Input/output: especially for variable-length encodings, such as UTF-8 (example below)
-
Chess engines: for bitboards; see [ChessProgramming1]
-
Genomics: according to [ARM1]
A GitHub code search for
reveals ~1300 files which use the intrinsic wrappers for the x86 instructions.
2.2. Motivating examples
2.2.1. Interleaving bits with expand_bits
A common use case for expansion is interleaving bits:
// interleave the bits of two 8-bit integers x and y expand_bitsr ( x , 0b10101010 ) | expand_bitsr ( y , 0b01010101 )
This translates Cartesian coordinates to the index on a Z-order curve. Space filling curves are a popular technique in compression.
2.2.2. UTF-8 decoding with compress_bitsr
The operations can also be used in various I/O-related applications.
For example,
is particularly useful for variable-length
encodings, where data bits are interrupted by bits which signal continuation of the data.
UTF-8 is a typical example:
uint_least32_t x = load32_little_endian ( utf8_data_pointer ); switch ( countl_one ( uint8_t ( x ))) { case 0 : return compress_bitsr ( x , 0b01111111 ); case 1 : /* error */ ; case 2 : return compress_bitsr ( x , 0b00111111'00011111 ); case 3 : return compress_bitsr ( x , 0b00111111'00111111'00001111 ); case 4 : return compress_bitsr ( x , 0b00111111'00111111'00111111'00000111 ); }
2.2.3. Other operations based on compress_bits
and expand_bits
Many operations can be built on top of
and
.
However, direct hardware support is often needed for these operations to be an efficient choice.
// x & 0xf expand_bitsr ( x , 0xf ) compress_bitsr ( x , 0xf ) // (x & 0xf) << 4 expand_bitsr ( x , 0xf0 ) // (x >> 4) & 0xf compress_bitsr ( x , 0xf0 ) // Clear the least significant one-bit of x. x ^= expand_bitsr ( 1 , x ) // Clear the nth least significant one-bit of x. x ^= expand_bitsr ( 1 << n , x ) // Clear the n least significant one-bits of x. x ^= expand_bitsr (( 1 << n ) - 1 , x ) // (x >> n) & 1 compress_bitsr ( x , 1 << n ) // Get the least significant bit of x. compress_bitsr ( x , x ) & 1 // Get the nth least significant bit of x. ( compress_bitsr ( x , x ) >> n ) & 1 // popcount(x) countr_one ( compress_bitsr ( -1u , x )) countr_one ( compress_bitsr ( x , x ))
2.3. Hardware support
Operation | x86_64 | ARM | RISC-V |
---|---|---|---|
| ( )
| SVE2
| ( )V
|
| BMI2
| SVE2
| ( V)
|
| BMI2
| SVE2
| ( + V)
|
| ( BMI2+ ABM)
| SVE2, ( SVE2+ SVE)
| ( )V
|
| ( BMI2+ ABM)
| ( SVE2+ SVE)
| ( + )V
|
(Parenthesized) entries signal that the instruction does not directly implement the function, but greatly assists in its implementation.
Note: The RISC-V instructions are all vector instructions. It is possible to turn each bit into a vector element, perform the wanted operation, and compress back into a bit-vector.
2.3.1. Support for reverse_bits
This operation is directly implemented in ARM through RBIT.
Any architecture with support for
(such as x86 with
)
also supports bit-reversal in part. [Warren1] presents an O(log n) algorithm which operates by swapping lower and upper
, ...,
,
,
,
, and
bits in parallel.
Byte-swapping implements these individual swaps up to 8 bits, requiring only three more
parallel swaps in software:
// assuming a byte is an octet of bits, and assuming the width of x is a power of two x = byteswap ( x ); x = ( x & 0x0F0F0F0F ) << 4 | ( x & 0xF0F0F0F0 ) >> 4 ; // ... quartets of bits x = ( x & 0x33333333 ) << 2 | ( x & 0xCCCCCCCC ) >> 2 ; // ... pairs of bits x = ( x & 0x55555555 ) << 1 | ( x & 0xAAAAAAAA ) >> 1 ; // ... individual bits
It is worth noting that clang provides a cross-platform family of intrinsics.
uses byte-swapping or bit-reversal instructions if possible.
Such an intrinsic has been requested from GCC users a number of times in [GNU1].
2.3.2. Support for compress_bits
and expand_bits
Starting with Haswell, Intel CPUs directly implement compression and expansion with with PDEP and PEXT. Intel x86_64 CPUs, as well AMD CPUs starting with Zen 3 implement PDEP and PEXT with 3 cycles latency. Zen 2 and older implement PDEP/PEXT in microcode, with 18 cycles latency.
ARM also supports these operations directly with BDEP, BEXT, and BGRP in the SVE2 instruction set. [Warren1] mentions other architectures with direct support.
Overall, only recent instruction set extensions offer this functionality directly. However, when the mask is a constant, many different strategies for hardware acceleration open up. For example
-
interleaving bits can be assisted (though not fully implemented) using ARM
/ZIP1 ZIP2 -
other permutations can be assisted by ARM
andTBL TBX
As [Warren1] explains, the cost of computing
and
in software is
dramatically lower for a constant mask.
For specific known masks (such as a mask with a single one-bit), the cost is extremely low.
All in all, there are multiple factors that strongly suggest a standard library implementation:
-
The strategy for computing
andcompress_bits
depends greatly on the architecture and on information about the mask, even if the exact mask isn’t known.expand_bits -
,TZCNT
(see [Zp7]), andCLMUL
are helpful.POPCNT
-
-
ISO C++ does not offer a mechanism through which all of this information can be utilized. Namely, it is not possible to change strategy based on information that only becomes available during optimization passes. Compiler extensions such as
offer a workaround.__builtin_constant_p -
ISO C++ does not offer a mechanism through which function implementations can be chosen based on the surrounding context. In a situation where multiple
calls with the same maskcompress_bits
are performed, it is significantly faster to pre-compute information based on the mask once, and utilize it in subsequent calls. The same technique can be used to accelerate integer division for multiple divisons with the same divisor.m
Bullets 2. and 3. suggest that
and
benefit from being
implemented directly in the compiler via intrinsic,
even if hardware does not directly implement these operations.
3. Impact on existing code
This proposal is purely a standard library expansion. No existing code is affected.
4. Design considerations
The design choices in this paper are based on [P0553R4], wherever applicable.
4.1. Why the names compress_bitsx
and expand_bitsx
?
There are multiple synonymous sets of terminology:
-
anddeposit extract -
andcompress expand -
andgather scatter
I have decided against
and
because of its ambiguity:
Taking the input
and densely packing it to
could be described as:
Extract each second bit from
and densely deposit it into the result.
0b10101
Similarly, taking the input
and expanding it into
could be described as:
Extract each bit from
and sparsely deposit it in the result.
0b111
Both operations can be described with
and
terminology,
making it virtually useless for keeping the operations apart.
and
are simply the least common way to describe these operations, which makes
and
the best candidates.
The use of
and
is consistent with the mask-based permutations for
proposed in [P2664R6].
The abbreviations
and
for left/right are consistent with the design choices in [P0553R4].
4.2. Further generalization
4.2.1. No generalized compress_bits
and expand_bits
[N3864] originally suggested much more general versions of compression and expansion, which support:
-
performing the operation not just on the whole operand, but on "words" of it, in parallel
-
performing the operation not just on bits, but on arbitrarily sized groups of bits
I don’t propose this generality for the following reasons:
-
The utility functions in
are not meant to provide a full bitwise manipulation library, but fundamental operations, especially those that can be accelerated in hardware while still having reasonable software fallbacks.< bit > -
These more general form can be built on top of the proposed hardware-oriented versions. This can be done with relative ease and with little to no overhead.
-
The generality falsely suggests hardware support for all forms, despite the function only being accelerated for specific inputs. This makes the performance characteristics unpredictable.
-
The proposed functions have wide contracts and can be
(Lakos rule). Adding additional parameters would likely require a narrow contract.noexcept -
Generality adds complexity to the standardization process, to implementation, and from the perspective of language users. It is unclear whether this added complexity is worth it in this case.
4.2.2. No generalized reverse_bits
Bit reversal can also be generalized to work with any group size:
template < typename T > T reverse_bits ( T x , int group_size = 1 ) noexcept ( false);
With this generalization,
on conventional platforms
is equivalent to
.
However, this general version is much less used, not as consistently supported in
hardware, and has a narrow contract.
must be a nonzero factor of
for this operation to be meaningful.
Therefore, a generalized bit-reversal is not proposed in this paper.
4.3. Why does the signature require two same T
s?
Initially, I went through a number of different signatures.
template < unsigned_integral T , unsigned_integral X > constexpr T compress_bitsr ( X x , T m ) noexcept ;
This signature is quite clever because the result never has more bits than the mask
.
However, it is surprising that the mask plays such a significant role here.
Furthermore, I’ve realized that while the result never has more bits than
,
must still deposit bits starting with the most significant bits of the result.
This suggests the following:
template < unsigned_integral T , unsigned_integral X > constexpr common_type_t < T , X > compress_bitsr ( X x , T m ) noexcept ;
However, it is not trivial to juggle bits between the left and right versions of
and
.
The behavior is also not intuitive when a zero-extension occurs.
For wider
, the mask is always zero-extended to the left, which makes the left and right
versions slightly asymmetrical.
Since this proposal includes low-level bit operations, it is reasonable and safe to require
the user to be explicit.
A call to
or
with two different types is likely a design flaw or bug.
Therefore, I have settled on the very simple signature:
template < unsigned_integral T > constexpr T compress_bitsr ( T x , T m ) noexcept ;
5. Possible implementation
The following implementations with linear time complexity are merely meant to show that an implementation is feasible. They demonstrate how a naive implementation with no hardware support may look.
5.1. reverse_bits
template < unsigned_integral T > constexpr T reverse_bits ( T x ) noexcept { T result = 0 ; for ( int i = 0 ; i < numeric_limits < T >:: digits ; ++ i ) { result <<= 1 ; result |= x & 1 ; x >>= 1 ; } return result ; }
5.2. compress_bits
template < unsigned_integral T > constexpr T compress_bitsr ( T m ) noexcept { T result = 0 ; for ( int i = 0 , j = 0 ; i != numeric_limits < T >:: digits ; ++ i ) { if (( m >> i ) & 1 ) { result |= (( x >> i ) & 1 ) << j ++ ; } } return result ; }
It is more likely that we have a direct hardware support for
.
The following implementation demonstrates how to delegate
.
template < unsigned_integral T > constexpr T compress_bitsl ( T x , T m ) noexcept { #ifdef __FAST_REVERSE_BITS__ return reverse_bits ( compress_bitsr ( reverse_bits ( x )), reverse_bits ( m ))); #else if ( m == 0 ) { // Prevents shift which is >= the operand size. return 0 ; } int shift = numeric_limits < T >:: digits - popcount ( m ); return compress_bitsr ( x , m ) << shift ; #endif }
5.3. expand_bits
template < unsigned_integral T > constexpr T expand_bitsr ( T x , T m ) noexcept { T result = 0 ; for ( int i = 0 , j = 0 ; i != numeric_limits < T >:: digits ; ++ i ) { if (( m >> i ) & 1 ) { result |= (( x >> j ++ ) & 1 ) << i ; } } return result ; } template < unsigned_integral T > constexpr T expand_bitsl ( T x , T m ) noexcept { #ifdef __FAST_REVERSE_BITS__ return reverse_bits ( expand_bitsl ( reverse_bits ( x )), reverse_bits ( m ))); #else if ( m == 0 ) { return 0 ; } int shift = numeric_limits < T >:: digits - popcount ( m ); return expand_bitsr ( x >> shift , m ); #endif }
5.4. Contemporary implementations
[Warren1] presents less naive algorithms:
-
An O(log n)
reverse_bits -
An O(log2 n)
andcompress_bits expand_bits
[Zp7] offers fast software implementations for PEXT and PDEP, optimized for x86_64.
[StackOverflow1] contains discussion of various possible software implementations
of
and
.
6. Proposed wording
The proposed changes are relative to the working draft of the standard as of [N4917].
In subclause 22.15 [bit], add the following subclause:
Permutation [bit.permute]
1 In the following descriptions, let N denote
. Let xn denote the n-th least significant bit of
numeric_limits < T > :: digits , so that
x equals
x template < class T > constexpr T reverse_bits ( T x ) noexcept ; 2 Constraints:
is an unsigned integer type ([basic.fundamental]).
T 3 Returns:
[Note:equals
reverse_bits ( reverse_bits ( x )) . — end note]
x template < class T > constexpr T compress_bitsr ( T x , T m ) noexcept ; 4 Constraints:
is an unsigned integer type ([basic.fundamental]).
T 5 Returns:
template < class T > constexpr T expand_bitsr ( T x , T m ) noexcept ; 6 Constraints:
is an unsigned integer type ([basic.fundamental]).
T 7 Returns:
[Note: Ifequals zero, then
x & ~ m equals
expand_bitsr ( compress_bitsr ( x , m ), m ) . If
x equals zero, then
x >> popcount ( m ) equals
compress_bitsr ( expand_bitsr ( x , m ), m ) . — end note]
x template < class T > constexpr T compress_bitsl ( T x , T m ) noexcept ; 8 Constraints:
is an unsigned integer type ([basic.fundamental]).
T 9 Returns:
.
reverse_bits ( compress_bitsr ( reverse_bits ( x ), reverse_bits ( m ))) template < class T > constexpr T expand_bitsl ( T x , T m ) noexcept ; 10 Constraints:
is an unsigned integer type ([basic.fundamental]).
T 11 Returns:
.
reverse_bits ( expand_bitsr ( reverse_bits ( x ), reverse_bits ( m )))
Note: I would have preferred a less mathematical approach to defining these functions.
However, it is too difficult to precisely define
and
without
visual aids, pseudo-code, or other crutches.
7. Acknowledgements
I greatly appreciate the assistance of Stack Overflow users in assisting me with research for this proposal. I especially thank Peter Cordes for his tireless and selfess dedication to sharing knowledge.
I also thank various Discord users from Together C & C++ and #include<C++> who have reviewed drafts of this proposal and shared their thoughts.