PxxxxR0
Bit permutations

New Proposal,

Author:
Audience:
SG18, LEWG
Project:
ISO/IEC 14882 Programming Languages — C++, ISO/IEC JTC1/SC22/WG21
Source:
eisenwave/cpp-proposals

Abstract

This proposal adds bit permutation functions to the bit manipulation library.

1. Introduction

I propose to expand the bit manipulation library with the following functions:

// all constraints are exposition-only
template<unsigned_integral T>
constexpr T reverse_bits(T x) noexcept;

Reverses the bits of x so that the least significant bit becomes the most significant.

template<unsigned_integral T>
constexpr T compress_bitsr(T x, T m) noexcept;

For each one-bit in m, the corresponding bit in x is taken and packed contiguously into the result, starting with the least significant result bit.

template<unsigned_integral T>
constexpr T compress_bitsl(T x, T m) noexcept;

For each one-bit in m, the corresponding bit in x is taken and packed contiguously into the result, starting with the most significant result bit.

template<unsigned_integral T>
constexpr T expand_bitsr(T x, T m) noexcept;

For each one-bit in m, a bit from x, starting with the least significant bit is taken and shifted into the corresponding position of the m bit.

template<unsigned_integral T>
constexpr T expand_bitsl(T x, T m) noexcept;

For each one-bit in m, a bit from x, starting with the most significant bit is taken and shifted into the corresponding position of the m bit.

2. Motivation and scope

Bit-reversal, compression, and expansion are fundamental operations that meet multiple criteria which make them suitable for standardization:

  1. They are common and useful operations.

  2. They can be used to implement numerous other operations.

  3. At least on some architectures, they have direct hardware support.

  4. They are non-trivial to implement efficiently in software.

  5. For known masks, numerous optimization opportunities are available.

2.1. Applications

2.1.1. Applications of reverse_bits

Bit-reversal is a common operation with uses in:

2.1.2. Applications of compress_bits and expand_bits

Compression and expansion are also common, with uses in:

A GitHub code search for /(_pdep_u|_pext_u)(32|64)/ AND language:c++ reveals ~1300 files which use the intrinsic wrappers for the x86 instructions.

2.2. Motivating examples

2.2.1. Interleaving bits with expand_bits

A common use case for expansion is interleaving bits:

// interleave the bits of two 8-bit integers x and y
expand_bitsr(x, 0b10101010) | expand_bitsr(y, 0b01010101)

This translates Cartesian coordinates to the index on a Z-order curve. Space filling curves are a popular technique in compression.

2.2.2. UTF-8 decoding with compress_bitsr

The operations can also be used in various I/O-related applications.

For example, compress_bits is particularly useful for variable-length encodings, where data bits are interrupted by bits which signal continuation of the data. UTF-8 is a typical example:

uint_least32_t x = load32_little_endian(utf8_data_pointer);
switch (countl_one(uint8_t(x))) {
case 0: return compress_bitsr(x, 0b01111111);
case 1: /* error */;
case 2: return compress_bitsr(x, 0b00111111'00011111);
case 3: return compress_bitsr(x, 0b00111111'00111111'00001111);
case 4: return compress_bitsr(x, 0b00111111'00111111'00111111'00000111);
}

2.2.3. Other operations based on compress_bits and expand_bits

Many operations can be built on top of compress_bits and expand_bits. However, direct hardware support is often needed for these operations to be an efficient choice.

// x & 0xf
expand_bitsr(x, 0xf)
compress_bitsr(x, 0xf)

// (x & 0xf) << 4
expand_bitsr(x, 0xf0)
// (x >> 4) & 0xf
compress_bitsr(x, 0xf0)

// Clear the least significant one-bit of x.
x ^= expand_bitsr(1, x)
// Clear the nth least significant one-bit of x.
x ^= expand_bitsr(1 << n, x)
// Clear the n least significant one-bits of x.
x ^= expand_bitsr((1 << n) - 1, x)

// (x >> n) & 1
compress_bitsr(x, 1 << n)
// Get the least significant bit of x.
compress_bitsr(x, x) & 1
// Get the nth least significant bit of x.
(compress_bitsr(x, x) >> n) & 1

// popcount(x)
countr_one(compress_bitsr(-1u, x))
countr_one(compress_bitsr(x, x))

2.3. Hardware support

Operation x86_64 ARM RISC-V
reverse_bits (BSWAP) RBITSVE2 (vrgather)V
compress_bitsr PEXTBMI2 BEXTSVE2 (vcompressV)
expand_bitsr PDEPBMI2 BDEPSVE2 (viota+vrgatherV)
compress_bitsl (PEXTBMI2+POPCNTABM) BGRPSVE2, (BEXTSVE2+CNTSVE) (vcompress)V
expand_bitsl (PDEPBMI2+POPCNTABM) (BDEPSVE2+CNTSVE) (viota+vrgather)V

(Parenthesized) entries signal that the instruction does not directly implement the function, but greatly assists in its implementation.

Note: The RISC-V instructions are all vector instructions. It is possible to turn each bit into a vector element, perform the wanted operation, and compress back into a bit-vector.

2.3.1. Support for reverse_bits

This operation is directly implemented in ARM through RBIT.

Any architecture with support for byteswap (such as x86 with BSWAP) also supports bit-reversal in part. [Warren1] presents an O(log n) algorithm which operates by swapping lower and upper N / 2, ..., 16, 8, 4, 2, and 1 bits in parallel. Byte-swapping implements these individual swaps up to 8 bits, requiring only three more parallel swaps in software:

// assuming a byte is an octet of bits, and assuming the width of x is a power of two
x = byteswap(x);
x = (x & 0x0F0F0F0F) <<  4 | (x & 0xF0F0F0F0) >>  4; // ... quartets of bits
x = (x & 0x33333333) <<  2 | (x & 0xCCCCCCCC) >>  2; // ... pairs of bits
x = (x & 0x55555555) <<  1 | (x & 0xAAAAAAAA) >>  1; // ... individual bits

It is worth noting that clang provides a cross-platform family of intrinsics. __builtin_bitreverse uses byte-swapping or bit-reversal instructions if possible.

Such an intrinsic has been requested from GCC users a number of times in [GNU1].

2.3.2. Support for compress_bits and expand_bits

Starting with Haswell, Intel CPUs directly implement compression and expansion with with PDEP and PEXT. Intel x86_64 CPUs, as well AMD CPUs starting with Zen 3 implement PDEP and PEXT with 3 cycles latency. Zen 2 and older implement PDEP/PEXT in microcode, with 18 cycles latency.

ARM also supports these operations directly with BDEP, BEXT, and BGRP in the SVE2 instruction set. [Warren1] mentions other architectures with direct support.

Overall, only recent instruction set extensions offer this functionality directly. However, when the mask is a constant, many different strategies for hardware acceleration open up. For example

As [Warren1] explains, the cost of computing compress_bits and expand_bits in software is dramatically lower for a constant mask. For specific known masks (such as a mask with a single one-bit), the cost is extremely low.

All in all, there are multiple factors that strongly suggest a standard library implementation:

  1. The strategy for computing compress_bits and expand_bits depends greatly on the architecture and on information about the mask, even if the exact mask isn’t known.

    • TZCNT, CLMUL (see [Zp7]), and POPCNT are helpful.

  2. ISO C++ does not offer a mechanism through which all of this information can be utilized. Namely, it is not possible to change strategy based on information that only becomes available during optimization passes. Compiler extensions such as __builtin_constant_p offer a workaround.

  3. ISO C++ does not offer a mechanism through which function implementations can be chosen based on the surrounding context. In a situation where multiple compress_bits calls with the same mask m are performed, it is significantly faster to pre-compute information based on the mask once, and utilize it in subsequent calls. The same technique can be used to accelerate integer division for multiple divisons with the same divisor.

Bullets 2. and 3. suggest that compress_bits and expand_bits benefit from being implemented directly in the compiler via intrinsic, even if hardware does not directly implement these operations.

3. Impact on existing code

This proposal is purely a standard library expansion. No existing code is affected.

4. Design considerations

The design choices in this paper are based on [P0553R4], wherever applicable.

4.1. Why the names compress_bitsx and expand_bitsx?

There are multiple synonymous sets of terminology:

  1. deposit and extract

  2. compress and expand

  3. gather and scatter

I have decided against deposit and extract because of its ambiguity:

Taking the input 0b10101 and densely packing it to 0b111 could be described as:

Extract each second bit from 0b10101 and densely deposit it into the result.

Similarly, taking the input 0b111 and expanding it into 0b10101 could be described as:

Extract each bit from 0b111 and sparsely deposit it in the result.

Both operations can be described with extract and deposit terminology, making it virtually useless for keeping the operations apart. gather and scatter are simply the least common way to describe these operations, which makes compress and expand the best candidates.

The use of compress and expand is consistent with the mask-based permutations for std::simd proposed in [P2664R6].

The abbreviations l and r for left/right are consistent with the design choices in [P0553R4].

4.2. Further generalization

4.2.1. No generalized compress_bits and expand_bits

[N3864] originally suggested much more general versions of compression and expansion, which support:

  1. performing the operation not just on the whole operand, but on "words" of it, in parallel

  2. performing the operation not just on bits, but on arbitrarily sized groups of bits

I don’t propose this generality for the following reasons:

  1. The utility functions in <bit> are not meant to provide a full bitwise manipulation library, but fundamental operations, especially those that can be accelerated in hardware while still having reasonable software fallbacks.

  2. These more general form can be built on top of the proposed hardware-oriented versions. This can be done with relative ease and with little to no overhead.

  3. The generality falsely suggests hardware support for all forms, despite the function only being accelerated for specific inputs. This makes the performance characteristics unpredictable.

  4. The proposed functions have wide contracts and can be noexcept (Lakos rule). Adding additional parameters would likely require a narrow contract.

  5. Generality adds complexity to the standardization process, to implementation, and from the perspective of language users. It is unclear whether this added complexity is worth it in this case.

4.2.2. No generalized reverse_bits

Bit reversal can also be generalized to work with any group size:

template <typename T>
T reverse_bits(T x, int group_size = 1) noexcept(false);

With this generalization, byteswap(x) on conventional platforms is equivalent to reverse_bits(x, 8).

However, this general version is much less used, not as consistently supported in hardware, and has a narrow contract. group_size must be a nonzero factor of x for this operation to be meaningful.

Therefore, a generalized bit-reversal is not proposed in this paper.

4.3. Why does the signature require two same Ts?

Initially, I went through a number of different signatures.

template<unsigned_integral T, unsigned_integral X>
constexpr T compress_bitsr(X x, T m) noexcept;

This signature is quite clever because the result never has more bits than the mask m. However, it is surprising that the mask plays such a significant role here.

Furthermore, I’ve realized that while the result never has more bits than m, compress_bitsl must still deposit bits starting with the most significant bits of the result. This suggests the following:

template<unsigned_integral T, unsigned_integral X>
constexpr common_type_t<T, X> compress_bitsr(X x, T m) noexcept;

However, it is not trivial to juggle bits between the left and right versions of compress_bits and expand_bits. The behavior is also not intuitive when a zero-extension occurs. For wider x, the mask is always zero-extended to the left, which makes the left and right versions slightly asymmetrical.

Since this proposal includes low-level bit operations, it is reasonable and safe to require the user to be explicit. A call to compress_bits or expand_bits with two different types is likely a design flaw or bug. Therefore, I have settled on the very simple signature:

template<unsigned_integral T>
constexpr T compress_bitsr(T x, T m) noexcept;

5. Possible implementation

The following implementations with linear time complexity are merely meant to show that an implementation is feasible. They demonstrate how a naive implementation with no hardware support may look.

5.1. reverse_bits

template<unsigned_integral T>
constexpr T reverse_bits(T x) noexcept
{
    T result = 0;
    for (int i = 0; i < numeric_limits<T>::digits; ++i) {
        result <<= 1;
        result |= x & 1;
        x >>= 1;
    }
    return result;
}

5.2. compress_bits

template<unsigned_integral T>
constexpr T compress_bitsr(T m) noexcept
{
    T result = 0;
    for (int i = 0, j = 0; i != numeric_limits<T>::digits; ++i) {
        if ((m >> i) & 1) {
            result |= ((x >> i) & 1) << j++;
        }
    }
    return result;
}

It is more likely that we have a direct hardware support for compress_bitsr. The following implementation demonstrates how to delegate compress_bitsl.

template<unsigned_integral T>
constexpr T compress_bitsl(T x, T m) noexcept
{
#ifdef __FAST_REVERSE_BITS__
    return reverse_bits(compress_bitsr(reverse_bits(x)), reverse_bits(m)));
#else
    if (m == 0) { // Prevents shift which is >= the operand size.
        return 0;
    }
    int shift = numeric_limits<T>::digits - popcount(m);
    return compress_bitsr(x, m) << shift;
#endif
}

5.3. expand_bits

template<unsigned_integral T>
constexpr T expand_bitsr(T x, T m) noexcept
{
    T result = 0;
    for (int i = 0, j = 0; i != numeric_limits<T>::digits; ++i) {
        if ((m >> i) & 1) {
            result |= ((x >> j++) & 1) << i;
        }
    }
    return result;
}

template<unsigned_integral T>
constexpr T expand_bitsl(T x, T m) noexcept
{
#ifdef __FAST_REVERSE_BITS__
    return reverse_bits(expand_bitsl(reverse_bits(x)), reverse_bits(m)));
#else
    if (m == 0) {
        return 0;
    }
    int shift = numeric_limits<T>::digits - popcount(m);
    return expand_bitsr(x >> shift, m);
#endif
}

5.4. Contemporary implementations

[Warren1] presents less naive algorithms:

[Zp7] offers fast software implementations for PEXT and PDEP, optimized for x86_64.

[StackOverflow1] contains discussion of various possible software implementations of compress_bitsr and expand_bitsr.

6. Proposed wording

The proposed changes are relative to the working draft of the standard as of [N4917].

In subclause 22.15 [bit], add the following subclause:

Permutation [bit.permute]

1 In the following descriptions, let N denote numeric_limits<T>::digits. Let xn denote the n-th least significant bit of x, so that x equals

template<class T>
  constexpr T reverse_bits(T x) noexcept;

2 Constraints: T is an unsigned integer type ([basic.fundamental]).

3 Returns:
[Note: reverse_bits(reverse_bits(x)) equals x. — end note]

template<class T>
  constexpr T compress_bitsr(T x, T m) noexcept;

4 Constraints: T is an unsigned integer type ([basic.fundamental]).

5 Returns:

template<class T>
  constexpr T expand_bitsr(T x, T m) noexcept;

6 Constraints: T is an unsigned integer type ([basic.fundamental]).

7 Returns:
[Note: If x & ~m equals zero, then expand_bitsr(compress_bitsr(x, m), m) equals x. If x >> popcount(m) equals zero, then compress_bitsr(expand_bitsr(x, m), m) equals x. — end note]

template<class T>
  constexpr T compress_bitsl(T x, T m) noexcept;

8 Constraints: T is an unsigned integer type ([basic.fundamental]).

9 Returns: reverse_bits(compress_bitsr(reverse_bits(x), reverse_bits(m))).

template<class T>
  constexpr T expand_bitsl(T x, T m) noexcept;

10 Constraints: T is an unsigned integer type ([basic.fundamental]).

11 Returns: reverse_bits(expand_bitsr(reverse_bits(x), reverse_bits(m))).

Note: I would have preferred a less mathematical approach to defining these functions. However, it is too difficult to precisely define compress_bits and expand_bits without visual aids, pseudo-code, or other crutches.

7. Acknowledgements

I greatly appreciate the assistance of Stack Overflow users in assisting me with research for this proposal. I especially thank Peter Cordes for his tireless and selfess dedication to sharing knowledge.

I also thank various Discord users from Together C & C++ and #include<C++> who have reviewed drafts of this proposal and shared their thoughts.

References

Normative References

[N4917]
Thomas Köppe. Working Draft, Standard for Programming Language C++. 5 September 2022. URL: https://wg21.link/n4917

Informative References

[ARM1]
Arm Developer Community. Introduction to SVE2, Issue 02, Revision 02. URL: https://developer.arm.com/-/media/Arm%20Developer%20Community/PDF/102340_0001_02_en_introduction-to-sve2.pdf?revision=b208e56b-6569-4ae2-b6f3-cd7d5d1ecac3
[ChessProgramming1]
VA. chessprogramming.org/BMI2, Applications. URL: https://www.chessprogramming.org/BMI2#Applications
[GNU1]
Marc Glisse et al.. Bug 50481 - builtin to reverse the bit order. URL: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50481
[N3864]
Matthew Fioravante. A constexpr bitwise operations library for C++. URL: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3864.html
[P0553R4]
Jens Maurer. Bit operations. URL: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p0553r4.html
[P2664R6]
Daniel Towner; Ruslan Arutyunyan. Extend std::simd with permutation API. URL: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p2664r6.html#permute_by_mask
[StackOverflow1]
Jan Schultke et al.. What is a fast fallback algorithm which emulates PDEP and PEXT in software?. URL: https://stackoverflow.com/q/77834169/5740428
[Warren1]
Henry S. Warren, Jr.. Hacker's Delight, 2nd Edition.
[Zp7]
Zach Wegner. Zach's Peppy Parallel-Prefix-Popcountin' PEXT/PDEP Polyfill. URL: https://github.com/zwegner/zp7