Given an array of 32-bit integers, write out only the elements that are strictly less than a threshold value, preserving their original order. Return the number of elements written.
This is a fundamental data processing primitive — the vectorized equivalent of a filter. The challenge is that the number of qualifying elements varies per SIMD block, so the output is variable-length.
#include <immintrin.h>
#include <cstdint>
int pack_and_filter(const int32_t* input, int32_t* output, int n, int32_t threshold);
Parameters:
input — pointer to the input array, guaranteed 32-byte alignedoutput — pointer to the output array, guaranteed 32-byte aligned, with space for at least n + 8 elementsn — number of elements, guaranteed to be a multiple of 8 and at least 8threshold — keep only elements strictly less than thisReturns: the number of elements written to output
Input: input = [1, 8, 3, 7, 2, 9, 4, 6], threshold = 5
Output: output = [1, 3, 2, 4], return 4
8 ≤ n ≤ 4,000,000n is always a multiple of 8input is 32-byte alignedoutput has at least n + 8 elements of space (the extra 8 account for trailing writes)The standard AVX2 approach uses a permutation lookup table:
movemask converts it to an 8-bit integer (256 possible values)_mm256_permutevar8x32_epi32 applies the permutationpopcount(mask)The LUT can be built at compile time with constexpr.
| Intrinsic | Description |
|---|---|
_mm256_load_si256(ptr) | Load 256 bits from aligned memory |
_mm256_storeu_si256(ptr, v) | Store 256 bits to memory (unaligned OK) |
_mm256_set1_epi32(x) | Broadcast a 32-bit integer |
_mm256_cmpgt_epi32(a, b) | Compare: all-ones where a > b |
_mm256_castsi256_ps(v) | Reinterpret as float vector (free) |
_mm256_movemask_ps(v) | Extract sign bits into 8-bit mask |
_mm256_permutevar8x32_epi32(v, idx) | Full cross-lane 32-bit permute using index vector |
__builtin_popcount(x) | Count set bits |
Output will appear here after you run or submit.