Parallel Find

CUDASTDFind Parallel Find Include the Header CUDASTDFind_1CUDASTDFindIncludeTheHeader Find an Element in a Range CUDASTDFind_1CUDASTDFindItems Find the Minimum Element in a Range CUDASTDFind_1CUDASTDFindMinItems Find the Maximum Element in a Range CUDASTDFind_1CUDASTDFindMaxItems Taskflow provides standalone template methods for finding elements in the given ranges using GPU. Include the Header You need to include the header file, taskflow/cuda/algorithm/find.hpp, for using the parallel-find algorithm. #include<taskflow/cuda/algorithm/find.hpp> Find an Element in a Range tf::cuda_find_if finds the index of the first element in the range [first, last) that satisfies the given criteria. This is equivalent to the parallel execution of the following loop: unsignedidx=0; for(;first!=last;++first,++idx){ if(p(*first)){ returnidx; } } returnidx; If no such an element is found, the size of the range is returned. The following code finds the index of the first element that is dividable by 17 over a range of one million elements. constsize_tN=1000000; autovec=tf::cuda_malloc_shared<int>(N);//vector autoidx=tf::cuda_malloc_shared<unsigned>(1);//index //initializesthedata for(size_ti=0;i<N;vec[i++]=rand()); //createanexecutionpolicy tf::cudaDefaultExecutionPolicypolicy; //findstheindexofthefirstelementthatisamultipleof17 tf::cuda_find_if( policy,vec,vec+N,idx,[]__device__(autov){returnv%17==0;} ); //waitforthefindoperationtocomplete stream.synchronize(); //verifiestheresult if(*idx!=N){ assert(vec[*idx]%17==0); } //deletesthememory cudaFree(vec); cudaFree(idx); The find-if algorithm runs asynchronously through the stream specified in the execution policy. You need to synchronize the stream to obtain the correct result. Find the Minimum Element in a Range tf::cuda_min_element finds the index of the minimum element in the given range [first, last) using the given comparison function object. This is equivalent to a parallel execution of the following loop: if(first==last){ return0; } autosmallest=first; for(++first;first!=last;++first){ if(op(*first,*smallest)){ smallest=first; } } returnstd::distance(first,smallest); The following code finds the index of the minimum element in a range of one millions elements using GPU computing: constsize_tN=1000000; autovec=tf::cuda_malloc_shared<int>(N);//vector autoidx=tf::cuda_malloc_shared<unsigned>(1);//index //initializesthedata for(size_ti=0;i<N;vec[i++]=rand()); //createanexecutionpolicy tf::cudaStreamstream; tf::cudaDefaultExecutionPolicypolicy(stream); //queriestherequiredbuffersizetofindtheminimumelementoverNelement autobytes=policy.min_element_bufsz<int>(N); autobuffer=tf::cuda_malloc_device<std::byte>(bytes); //findstheminimumelementusingthelesscomparator tf::cuda_min_element( policy,vec,vec+N,idx,[]__device__(autoa,autob){returna<b;},buffer ); //waitforthemin-elementoperationcompletes stream.synchronize(); //verifiestheresult assert(vec[*idx]==*std::min_element(vec,vec+N,std::less<int>{})); //deletesthememory cudaFree(vec); cudaFree(idx); cudaFree(buffer); Since the GPU min-element algorithm may require extra buffer to store the temporary results, you need to provide a buffer of size at least larger or equal to the value returned from tf::cudaDefaultExecutionPolicy::min_element_bufsz. You must keep the buffer alive before the tf::cuda_min_element completes. Find the Maximum Element in a Range Similar to tf::cuda_min_element, tf::cuda_max_element finds the index of the maximum element in the given range [first, last) using the given comparison function object. This is equivalent to a parallel execution of the following loop: if(first==last){ return0; } autolargest=first; for(++first;first!=last;++first){ if(op(*largest,*first)){ largest=first; } } returnstd::distance(first,largest); The following code finds the index of the maximum element in a range of one millions elements using GPU computing: constsize_tN=1000000; autovec=tf::cuda_malloc_shared<int>(N);//vector autoidx=tf::cuda_malloc_shared<unsigned>(1);//index //initializesthedata for(size_ti=0;i<N;vec[i++]=rand()); //createanexecutionpolicy tf::cudaStreamstream; tf::cudaDefaultExecutionPolicypolicy(stream); //queriestherequiredbuffersizetofindthemaximumelementoverNelement autobytes=policy.max_element_bufsz<int>(N); autobuffer=tf::cuda_malloc_device<std::byte>(bytes); //findsthemaximumelementusingthelesscomparator tf::cuda_max_element( policy,vec,vec+N,idx,[]__device__(autoa,autob){returna<b;},buffer ); //waitforthemax-elementoperationtocomplete stream.synchronize(); //verifiestheresult assert(vec[*idx]==*std::max_element(vec,vec+N,std::less<int>{})); //deletesthememory cudaFree(vec); cudaFree(idx); cudaFree(buffer); Since the GPU max-element algorithm may require extra buffer to store the temporary results, you need to provide a buffer of size at least larger or equal to the value returned from tf::cudaDefaultExecutionPolicy::max_element_bufsz. You must keep the buffer alive before tf::cuda_max_element completes.