Parallel prefix and Computational Analysis in Multi-Core 07/06 Update SLTechnology News&Howtos

Parallel prefix and Computational Analysis in Multi-Core

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article introduces the relevant knowledge of "parallel prefix and computational analysis in multi-core". In the operation process of actual cases, many people will encounter such difficulties. Next, let Xiaobian lead you to learn how to deal with these situations. I hope you can read carefully and learn something!

1. Calculation of serial prefix sum

If we give a sequence a[n], let S[k] = a[0]+a[1]+...+ A[k], k = 0, 1, 2…n-1, the sequence S[k] is the prefix sum of the sequence A[n]. For example, the following column of data:

a[4] = {1, 2, 3, 4};

Its prefix and are

S[0] = a[0] = 1;

S[1] = a[0] + a[1] = 1+ 2 = 3;

S[2] = a[0] + a[1] + a[2] = 1 + 2 + 3 = 6;

S[3] = a[0] + a[1] + a[2] + a[3] = 1 + 2 + 3 + 4 = 10;

The calculation of prefix sum is very simple. Generally, the calculation of serial prefix sum can be performed by the following function:

/** serial prefix sum calculation function @param T * pInput -Input data @param T *pOutput -Output data (prefix and) @param int nLen -Enter the length of the data @return void -none */ template void Sequential_PrefixSum(T * pInput, T *pOutput, int nLen) { int i; pOutput[0] = pInput[0]; for ( i = 1; i

< nLen; i++ ) { pOutput[i] = pInput[i] + pOutput[i-1]; } } 在上面的串行前缀和计算代码中可以看出，每次循环都依赖于上一次循环的结果，因此无法直接对循环进行并行化，要进行并行化则必须修改计算方法，下面就来看如何进行并行前缀和的计算。 2、并行前缀和的计算前缀和的并行计算方法有许多种，David Callahan的论文"Recognizing and Parallelizing Bounded Recurrences"中给出了一种适合共享存储多处理器系统中的有界递归计算的通用方法，前缀和计算属于有界递归计算的一个特例。下面先以一个实例来讲解整个并行计算的过程：例：假设有4个处理器要计算16个整数的前缀和，这16个整数如下： 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 第1步，先将上面数据平分成4组，每个处理器各计算一组数据的前缀和，如下所示：（1 2 3 4）（5 6 7 8）（9 10 11 12）（13 14 15 16）（1 3 6 10）（5 11 18 26）（9 19 30 42）（13 27 42 58）第2步，选取每组数据的***一个数据，对这几个数据计算前缀和，如下所示：（1 3 6 10）（5 11 18 26）（9 19 30 42）（13 27 42 58）（1 3 6 10）（5 11 18 36）（9 19 30 78）（13 27 42 136）经过这步的计算后，可以很容易发现，每组的***一个数据的值已经变成了原始数据在它所处位置之前（包含本位置）的所有数据的和。例如： 36 ＝ 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 78 = 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10 + 11 + 12 第3步：从第2组数开始，将每组中的数（除***一个数外）加上它的前一组数的***一个数，即可得到所有数的前缀和。如下所示： (1 3 6 10) (5+10 11+10 18+10 36) (9+36 19+36 30+36 78) (13+78 27+78 42+78 136） (1 3 6 10) (15 21 28 36) (45 55 66 78) (91 105 120 136）从上面的计算过程可以看出，第1步和第3步都是很容易进行并行化计算，第2步中，由于计算量非常小，用串行计算即可，下面给出上面处理过程的代码实现： #define MIN_PRARLLEL_PREFIXSUM_COUNT 8192 /** 并行前缀和计算函数 @param T * pInput - 输入数据 @param T *pOutput - 输出数据（前缀和） @param int nLen - 输入数据长度 @return void - 无 */ template void Parallel_PrefixSum(T * pInput, T *pOutput, int nLen) { int i; int nCore = omp_get_num_procs(); if ( nCore < 4 || nLen < MIN_PRARLLEL_PREFIXSUM_COUNT ) { Sequential_PrefixSum(pInput, pOutput, nLen); return; } int nStep = nLen / nCore; if ( nStep * nCore < nLen ) { nStep += 1; } #pragma omp parallel for num_threads(nCore) for ( i = 0; i < nCore; i++ ) { int k; int nStart = i * nStep; int nEnd = (i+1) * nStep; if ( nEnd >

nLen ) { nEnd = nLen; } pOutput[nStart] = pInput[nStart]; for ( k = nStart+1; k

< nEnd; k++ ) { pOutput[k] = pInput[k] + pOutput[k-1]; } } for ( i = 2; i < nCore; i++ ) { pOutput[i * nStep - 1] += pOutput[(i-1) * nStep - 1]; } pOutput[nLen-1] += pOutput[(nCore-1)*nStep - 1]; #pragma omp parallel for num_threads(nCore-1) for ( i = 1; i < nCore; i++ ) { int k; int nStart = i * nStep; int nEnd = (i+1) * nStep - 1; if ( nEnd >

= nLen ) { nEnd = nLen - 1; } for ( k = nStart; k < nEnd; k++ ) { pOutput[k] += pOutput[nStart-1]; } } return; }

As can be seen from the calculation process of parallel prefix sum above, its calculation amount is almost double that of serial prefix sum, and if the actual overhead in the program is considered, the calculation increase is more than double. Therefore, on dual-core CPU machines, using parallel prefix and calculation makes no sense, only on machines with more than 4 cores CPU, it has practical value.

In Parallel_PrefixSum() function, it is judged whether the CPU core number is less than 4 and whether the calculation amount is insufficient. If either of the two judgment conditions is satisfied, the serial prefix core calculation function is invoked to calculate, and then the parallel prefix sum is calculated. In parallel computing, the calculation is simply spread evenly to each CPU, without considering that the calculation amount is spread evenly to each CPU core when the CPU core number is large. The problem of too small thread granularity is mainly to avoid making the code look too complicated. If necessary, it can be modified to automatically calculate the most appropriate number of threads (which may be less than the number of CPU cores), and then use the corresponding number of threads for parallel calculation.

"Parallel prefixes and computational analysis in multicore" is introduced here. Thank you for reading it. If you want to know more about industry-related knowledge, you can pay attention to the website. Xiaobian will output more high-quality practical articles for everyone!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.