Compute the inclusive prefix sum of an array using AVX2 intrinsics.
The prefix sum (also called scan) of an array is a new array where each element is the sum of all elements up to and including that position: output[i] = input[0] + input[1] + ... + input[i].
Prefix sum is inherently sequential — each output depends on all previous inputs. The SIMD challenge is to compute it faster despite this dependency.
#include <immintrin.h>
#include <cstdint>
void prefix_sum(const int32_t* input, int32_t* output, int n);
Parameters:
input — pointer to the input array, guaranteed 32-byte alignedoutput — pointer to the output array, guaranteed 32-byte alignedn — number of elements, guaranteed to be a multiple of 8 and at least 8Returns: nothing (result is written to output)
Input: [1, 2, 3, 4, 5, 6, 7, 8]
Output: [1, 3, 6, 10, 15, 21, 28, 36]
8 ≤ n ≤ 4,000,000n is always a multiple of 8[-100, 100]The standard approach computes a prefix sum within each 8-element block using shifts and adds, then propagates the block total to the next block as a running offset.
The tricky part is that _mm256_slli_si256 shifts within each 128-bit lane independently — it does not shift across the lane boundary. You need a separate cross-lane fix-up step.
| Intrinsic | Description |
|---|---|
_mm256_load_si256(ptr) | Load 256 bits from aligned memory |
_mm256_store_si256(ptr, v) | Store 256 bits to aligned memory |
_mm256_slli_si256(v, n) | Shift each 128-bit lane left by n bytes (zero fill) |
_mm256_add_epi32(a, b) | Add packed 32-bit integers |
_mm256_set1_epi32(x) | Broadcast a 32-bit integer |
_mm256_shuffle_epi32(v, imm) | Shuffle 32-bit elements within each 128-bit lane |
_mm256_permute2x128_si256(a, b, imm) | Select 128-bit lanes from two registers |
_mm256_extracti128_si256(v, 1) | Extract high 128-bit lane |
_mm_extract_epi32(v, idx) | Extract a 32-bit element from 128-bit register |
Output will appear here after you run or submit.