Parallel Scan

CUDASTDScan Parallel Scan Include the Header CUDASTDScan_1CUDASTDParallelScanIncludeTheHeader What is a Scan Operation? CUDASTDScan_1CUDASTDWhatIsAScanOperation Scan a Range of Items CUDASTDScan_1CUDASTDScanItems Scan a Range of Transformed Items CUDASTDScan_1CUDASTDScanTransformedItems Taskflow provides standard template methods for scanning a range of items on a CUDA GPU. Include the Header You need to include the header file, taskflow/cuda/algorithm/scan.hpp, for using the parallel-scan algorithm. #include<taskflow/cuda/algorithm/find.hpp> What is a Scan Operation? A parallel scan task performs the cumulative sum, also known as prefix sum or scan, of the input range and writes the result to the output range. Each element of the output range contains the running total of all earlier elements using the given binary operator for summation. Scan a Range of Items tf::cuda_inclusive_scan computes an inclusive prefix sum operation using the given binary operator over a range of elements specified by [first, last). The term "inclusive" means that the i-th input element is included in the i-th sum. The following code computes the inclusive prefix sum over an input array and stores the result in an output array. constsize_tN=1000000; int*input=tf::cuda_malloc_shared<int>(N);//inputvector int*output=tf::cuda_malloc_shared<int>(N);//outputvector //initializesthedata for(size_ti=0;i<N;input[i++]=rand()); //createanexecutionpolicy tf::cudaStreamstream; tf::cudaDefaultExecutionPolicypolicy(stream); //queriestherequiredbuffersizetoscanNelementsusingthegivenpolicy autobytes=policy.scan_bufsz<int>(N); autobuffer=tf::cuda_malloc_device<std::byte>(bytes); //computesinclusivescanoverinputandstorestheresultinoutput tf::cuda_inclusive_scan(policy, input,input+N,output,[]__device__(inta,intb){returna+b;},buffer ); //synchronizesandverifiestheresult stream.synchronize(); for(size_ti=1;i<N;i++){ assert(output[i]==output[i-1]+input[i]); } //deletethedevicememory cudaFree(input); cudaFree(output); cudaFree(buffer); The scan algorithm runs asynchronously through the stream specified in the execution policy. You need to synchronize the stream to obtain correct results. Since the GPU scan algorithm may require extra buffer to store the temporary results, you need to provide a buffer of size at least larger or equal to the value returned from tf::cudaDefaultExecutionPolicy::scan_bufsz. You must keep the buffer alive before the scan call completes. On the other hand, tf::cuda_exclusive_scan computes an exclusive prefix sum operation. The term "exclusive" means that the i-th input element is NOT included in the i-th sum. //computesexclusivescanoverinputandstorestheresultinoutput tf::cuda_exclusive_scan(policy, input,input+N,output,[]__device__(inta,intb){returna+b;},buffer ); //synchronizestheexecutionandverifiestheresult stream.synchronize(); for(size_ti=1;i<N;i++){ assert(output[i]==output[i-1]+input[i-1]); } Scan a Range of Transformed Items tf::cuda_transform_inclusive_scan transforms each item in the range [first, last) and computes an inclusive prefix sum over these transformed items. The following code multiplies each item by 10 and then compute the inclusive prefix sum over 1000000 transformed items. constsize_tN=1000000; int*input=tf::cuda_malloc_shared<int>(N);//inputvector int*output=tf::cuda_malloc_shared<int>(N);//outputvector //initializesthedata for(size_ti=0;i<N;input[i++]=rand()); //createanexecutionpolicy tf::cudaStreamstream; tf::cudaDefaultExecutionPolicypolicy(stream); //queriestherequiredbuffersizetoscanNelementsusingthegivenpolicy autobytes=policy.scan_bufsz<int>(N); autobuffer=tf::cuda_malloc_device<std::byte>(bytes); //computesinclusivescanovertransformedinputandstorestheresultinoutput tf::cuda_transform_inclusive_scan(policy, input,input+N,output, []__device__(inta,intb){returna+b;},//binaryscanoperator []__device__(inta){returna*10;},//unarytransformoperator buffer ); //waitforthescantocomplete stream.synchronize(); //verifiestheresult for(size_ti=1;i<N;i++){ assert(output[i]==output[i-1]+input[i]*10); } //deletethedevicememory cudaFree(input); cudaFree(output); cudaFree(buffer); Similarly, tf::cuda_transform_exclusive_scan performs an exclusive prefix sum over a range of transformed items. The following code computes the exclusive prefix sum over 1000000 transformed items each multiplied by 10. constsize_tN=1000000; int*input=tf::cuda_malloc_shared<int>(N);//inputvector int*output=tf::cuda_malloc_shared<int>(N);//outputvector //initializesthedata for(size_ti=0;i<N;input[i++]=rand()); //createanexecutionpolicy tf::cudaStreamstream; tf::cudaDefaultExecutionPolicypolicy(stream); //queriestherequiredbuffersizetoscanNelementsusingthegivenpolicy autobytes=policy.scan_bufsz<int>(N); autobuffer=tf::cuda_malloc_device<std::byte>(bytes); //computesexclusivescanovertransformedinputandstorestheresultinoutput tf::cuda_transform_exclusive_scan(policy, input,input+N,output, []__device__(inta,intb){returna+b;},//binaryscanoperator []__device__(inta){returna*10;},//unarytransformoperator buffer ); //waitforthescantocomplete stream.synchronize(); //verifiestheresult for(size_ti=1;i<N;i++){ assert(output[i]==output[i-1]+input[i-1]*10); } //deletethedevicememory cudaFree(input); cudaFree(output); cudaFree(buffer);