Count the total number of set bits (1-bits) across an entire array of 32-bit unsigned integers, using the AVX2 shuffle-based nibble lookup technique.
This is the Muła popcount algorithm — a classic example of using _mm256_shuffle_epi8 as a parallel 16-entry lookup table.
#include <immintrin.h>
#include <cstdint>
int64_t population_count(const uint32_t* arr, int n);
Parameters:
arr — pointer to an array of n unsigned 32-bit integers, guaranteed 32-byte alignedn — number of elements, guaranteed to be a multiple of 8 and at least 8Returns: the total number of set bits across all elements (as int64_t)
Input: [7, 0, 15, 1, 3, 255, 0, 8]
Output: 19
Explanation: popcount of each: 3+0+4+1+2+8+0+1 = 19
8 ≤ n ≤ 4,000,000n is always a multiple of 8arr is 32-byte alignedint64_tThe key insight: a nibble (4 bits) has only 16 possible values, so its popcount fits in a 16-entry table. _mm256_shuffle_epi8 acts as a parallel lookup — it uses each byte of the index vector to select from a 16-byte table (within each 128-bit lane).
Split each byte into its low and high nibbles, look up both, and add. Accumulate in 8-bit counters (which overflow after ~31 iterations), then periodically widen to 64-bit using _mm256_sad_epu8.
| Intrinsic | Description |
|---|---|
_mm256_load_si256(ptr) | Load 256 bits from aligned memory |
_mm256_setr_epi8(...) | Set 32 individual bytes |
_mm256_set1_epi8(x) | Broadcast a byte to all 32 positions |
_mm256_shuffle_epi8(table, idx) | Parallel byte lookup (per 128-bit lane) |
_mm256_and_si256(a, b) | Bitwise AND |
_mm256_srli_epi16(v, n) | Shift each 16-bit element right by n bits |
_mm256_add_epi8(a, b) | Add packed 8-bit integers |
_mm256_sad_epu8(a, b) | Sum absolute differences of bytes → 64-bit |
_mm256_add_epi64(a, b) | Add packed 64-bit integers |
Output will appear here after you run or submit.