Parallel Prefix Sum, Scatter and Gather

10 min readDec 6, 2022

The outcome of this blog is to find the Prefix Sum SCAN using parallel computing.

What is SCAN Algorithm???

A SCAN algorithm is also called as Elevator Algorithm.

In the SCAN algorithm, the disk arm moves from the start to the end, and on it’s way satisfies all the requests that comes into it’s path. When satisfying the request, it follows a certain rule, depending on the type of SCAN algorithm it is performing.

There are two types of SCAN,

Exclusive SCAN: Here, for “i” elements, operation is performed on all the elements upto “i”, but not including “i”th element.
Inclusive SCAN: Here, for “i” elements, operation is performed on all the elements upto “i”, including the “i”th element.

For the remainder of this blog we will only discuss about Inclusive SCAN.

A SCAN algorithm generalizes prefix sum to other operations.

There are many types of SCAN algorithm, some of them are:

+ scan: This is also called as “add-scan” or “prefix sum”. Here we add all the previous values to the current position.
max-scan: Here we compare the maximum value of the current to the maximum value of all the previous elements to the current position.
*-scan: This is also called as “product-scan” or “product sum”. Here the products of all the previous element to the current element are calculated and stored.
and-scan: Here, we calculate the cumulative “logical and”.

There are many more SCAN algorithms, but these are some of the popular ones.

Let’s discuss about the prefix sum algorithm in more detail as the task of this blog is to find the prefix sum using parallelization.

Prefix Sum

So, what is a prefix sum?

Prefix sum is the cumulative sum of all the previous occurred elements with the current element.

Let’s discuss this in more detail by taking an example using Figure 1.

**Figure 1: Understanding Prefix Sum** (Reference: https://williamrjribeiro.com/?p=132)

Given,

An array A: [2, 3, 7, 5],

Now, let’s try to find the prefix sum of A.

Let’s define the output array as P.
Array P for each index contains sum of all the previously occurred elements. We can use a loop where we add the previous element to the current element throughout the array.

So, let’s type a pseudo-code for the given:

// We define an array A
A = [2, 3, 7, 5]
P = A.copy() // P is a copy of elements of A
// We use a loop to iterate through the array P and add the previous element to the current one
for(int i=1; i<P.length(); i++){
P[i] = P[i] + P[i-1]
}

So,

P[0] = A[0],
P[1] = A[0] + A[1],
P[2] = A[0] + A[1] + A[2],
P[3] = A[0] + A[1] + A[2] + A[3]

Therefore,

P[0] = 2, // The first element is 2, so, P[0] = 2
P[1] = 2 + 3 = 5, //The second element is 1, so, P[1] = 2 + 3; which adds up to 5 and replaces the original position in the array
P[2] = 5 + 7= 12, //The third element is 7, and the sum of first two elements is stored in P[1] but the first two are stored in P[1], so P[2] = P[2] + P[1] = 7 + 5; so, not P[2] is updated to 12
P[3] = 12 + 5 = 17 //The fourth element is 5, and the sum of first three elements is stored in P[2] but the first three are already stored in P[2], so P[3] = P[3] + P[2] = 5 + 12; so, not P[3] is updated to 17

Hence, the new, updated array is [2, 5, 12, 17]. This is prefix sum problem in computer science.

This is the sequential implementation. Here, we perform O(n) number of additions.

Parallel Prefix Sum Algorithm

There are different approaches for performing prefix sum parallelly. Let’s start by discussing the most naive approach.

Naive Parallel Prefix Sum Approach

**Figure 2: The Naive Approach for Parallel Prefix Sum** (Reference: https://developer.nvidia.com/sites/all/modules/custom/gpugems/books/GPUGems3/elementLinks/39fig02.jpg)

Here, in pairs of two, every element is added and then stored on to the latter one of the two taken. Then the header is moved on and now, every second elements are added on gaps of two from the original array that was changed. Then the header is moved and added on gaps of three. This computation is performed log2n number of times; where;
n = number of elements in the array

**Figure 3:** **The Naive Approach for Parallel Prefix Sum** (Reference: https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Hillis-Steele_Prefix_Sum.svg/450px-Hillis-Steele_Prefix_Sum.svg.png)

Pseudocode of naive approach:

addScan(A[1:n]) {
if n=1 then return A[1]
let Io[1:n/2] = odd indices // e.g., 1,3,5,…and so on
let Ie[1:n/2] = even indices // e.g., 2,4,5,… and so on
A[Ie] = A[Ie] + A[Io]
A[Ie] = addScan(A[Ie])
A[Io] = A[Ie[2:]] + A[Io[2:]]
}

The algorithm performs O(nlog2n) addition operations.
Remember that in the sequential scan we perform O(n) number of additions.

Also, notice that this algorithm does n/2 additions at “A[Ie] = A[Ie] + A[Io]” statement.
And the algorithm does (n/2)-1 additions at “A[Io] = A[Ie[2:]] + A[Io[2:]]” statement.
And finally the add scan, or the prefix sum operates on a problem that is half the size.

So, the recurrence of total work is:

W(n) =
{n-1 + W(n/2) ; if n≥2
0; n≤1}

So, the total work is O(n).

Let’s not calculate the spam.

The recursive call on the statement “A[Ie] = addScan(A[Ie])” will give O(logn) levels.
All the other operations, i.e., “A[Ie] = A[Ie] + A[Io]” and “A[Io] = A[Ie[2:]] + A[Io[2:]]” are data parallel. We can implement that by parallel force. If we solve the recurrence we got O(logn).

So, the total spam is:

D(n) =
{O(logn) + D(n/2) ; if n≥2
O(1) ; n≤1}

By solving this, we get the total spam as O(log²n).

But, this algorithm assumes that there are many processors as a data element. But, that’s not really the case when we use a very large array.

So, in that case we need to divide the big prefix sum scan algorithm into number of threads that can go through a portion of array on a single multiprocessor of the GPU.

But, not all threads can run parallelly for arrays larger than the size. Therefore, the given naive approach won’t work because it performs scan in place in the original array. And for a larger array, the results of one wrap or iteration will be overwritten by the others, causing a problem for huge computation problems.

To solve this problem a double buffer array scan is used, where we create two temporary arrays. This double buffer scan algorithm was provided by William Daniel “Danny” Hillis and Guy Lewis Steele Jr..

The pseudocode for double buffer scan algorithm is:

for(int i=0; i<log2n; i++){
for (int j=0; j<n-1) do in parallel{
if(j<2^i){
x[i+1][j] = x[i][j]}
else{
x[i+1][j] = x[i][j] + x[i][j — 2^i]}
}
}

This algorithm is used when we are computing parallelly using GPU. In this case the sub-arrays are created temporarily and it can be performed for huge computational problems without any problems.

Hence, the double buffer prefix sum scan algorithm is used when we are performing a scan algorithm task in the real world.

But, the given naive approach of parallel prefix sum algorithm will probably perform very poorly on huge arrays because of it’s complex nature and it’s work-inefficiency.

Work Efficient Parallel Prefix Sum Approach

Because of the inefficiency of naive approached we discussed earlier, we need a more complex parallel algorithm which will reduce the number of computations performed providing a much faster time to execute.

Hence, an algorithms needs to be made which will approach the efficiency of sequential algorithm while taking advantage of the parallelization given by the GPU.

So, the work-efficient algorithm is derived from the famous balanced trees problem in parallel computing. Here, we will first build a balanced binary tree on the input data and then sweep it to and from the roots to computer the scan operation.

Here, the binary tree has n number of leaves in it.
Also, the depth of the tree is log2n.
On top of that, each depth level has 2d number of nodes present in it.

So, we perform one add for each node, and so, we will be performing O(n) number of operations or addition in this case on a single traversal of the tree.

This algorithm is called as a work-efficient scan algorithm.

The tree used here is not really a data structure we are using, but it’s a concept we use here for determining what each thread does at each step of the traversal throughout the algorithm.

In the work-efficient scan algorithm, the operations are performed in-place on an array in the shared memory.

This work-efficient algorithm was present by Blelloch in the year 1990.

This algorithm provides a major increase in efficiency as the algorithm avoids the extra (log2n) of the naive parallel prefix sum approach of the SCAN algorithm.

The algorithm works in two phases:

The Up-Sweep (Reduce) Phase of a Work-Efficient Parallel Prefix Sum Scan Algorithm

Pseudocode of the up-sweep or the reduce phase of work-efficient parallel prefix sum scan algorithm:

for(int d=0; d<log2(n-1); d++){
for all k=0 to n-1 by 2^(d+1) in parallel do{
x[k + 2^(d+1)–1] = x[k + 2^(d) — 1] + x[k + 2^(d) +1–1]
}
}

In this phase we calculate the partial sums at internal nodes of the tree while traversing through the tree. Here, the root node holds the sum of all nodes in the array.

**Figure 4:** An Illustration of the Up-Sweep, or Reduce, Phase of a Work-Efficient Sum Scan Algorithm (Reference: https://developer.nvidia.com/sites/all/modules/custom/gpugems/books/GPUGems3/elementLinks/39fig03.jpg)

2. The Down-Sweep Phase of a Work-Efficient Parallel Prefix Sum Scan Algorithm

**Figure 5:** An Illustration of the Down-Sweep Phase of the Work-Efficient Parallel Sum Scan Algorithm (Reference: https://developer.nvidia.com/sites/all/modules/custom/gpugems/books/GPUGems3/elementLinks/39fig03.jpg)

Now that we have the partial sums at internal nodes of the tree at each depth we will traverse down the tree from the root. This is the down-sweep phase of the work-efficient parallel prefix sum scan algorithm.

While traversing back down from the root we use the partial sums from the Up-sweep phase to build the scan in place on the array.

We first insert zero at the root of the tree, and then on each step, on each node at the current level, it’s value is passed to the left child and the sum of it’s value and the former value of it’s left child to it’s right child.

x[n-1] = 0
for d = log2(n — 1) down to 0 do{
for all k=0 to n-1 by 2^(d)+1 in parallel do{
temp = x[k+2^(d)-1]
x[k+2^(d)-1] = x[k+2^(d)+1–1]
x[k+2^(d)+1–1] =x[k+2^(d)+1–1] + temp
}
}

**Figure 6:** **The Work Efficient Parallel Prefix Sum Approach** (Reference: https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Hillis-Steele_Prefix_Sum.svg/450px-Hillis-Steele_Prefix_Sum.svg.png)

So, overall,

this work-efficient scan algorithm performs O(n) computational operations where;

number of adds = 2*(n-1);
number of swaps = n-1

This makes it work-efficient as it’s more viable for large arrays, and will definitely perform much better than the naive prefix sum scan algorithm approach.

The recursion continues to a depth of O(log n), considering that the input sequence had n number of steps.

The total number of steps are O(n) and this work-efficient prefix sum scan algorithm can be implemented on a P-RAM with O(n/logn) processors without any asymptotic slowdown. This can be achieved by assigning multiple indices to each processor in different rounds when the size of the array is very huge or more than the number of wraps.

Conclusion with regards to (1) Naive Parallel Prefix Sum Approach and (2) Work Efficient Parallel Prefix Sum Approach

Both of the algorithms, the naive parallel prefix scan and the work-efficient parallel scan runs in O(logn) time complexity. But, the naive approach takes (log2n) steps whereas the work-efficient parallel prefix scan algorithm requires (2log2n-2) number of steps to perform the same scan computation.

Each of the preceding algorithms runs in O(log n) time. However, the former takes exactly log2 n steps, while the latter requires (2log2 n − 2) steps.

Figure 3, and figure 6 illustrate a 16-input example for naive and work-efficient approach.
Naive approach is 12-way parallel (49 units of work is divided by a span of 4). On the other hand, work-efficient parallel prefix sum approach is only 4-way parallel (26 units of work divided by a span of 6).
However, the first algorithm is work-inefficient because it performs asymptotically more work than that is required sequentially.
The second algorithm is work-efficient as it performs only a constant factor (2) of the amount of work required by the sequential algorithm.

Both the algorithms have it’s advantage though. Naive Parallel Prefix Sum Approach is more likely to perform better when the size of array is less than the number of threads available. On the other hand, Work Efficient Parallel Prefix Sum Approach is likely to perform better when parallelism is more limited.

Scatter and Gather

Gather and scatter are two fundamental data-parallel operations. Let’s discuss them individually;

What is Gather?

It is a type of memory addressing that collects or gathers data from multiple arbitrary indices, or, in simpler terms multiple given locations.

Pseudocode for gather;

for(int i=0; i<n; ++i){
x[i] = y[idx[i]]
}

What is Scatter?

It is a type of memory addressing that stores or scatters data from multiple arbitrary indices, or, in simpler terms multiple given locations.

Pseudocode for scatter;

for(int i=0; i<n; ++i){
y[idx[i]]<x[i]
}

References: