CUDASTDMerge Parallel Merge Include the Header CUDASTDMerge_1CUDASTDMergeIncludeTheHeader Merge Two Sorted Ranges of Items CUDASTDMerge_1CUDASTDMergeItems Merge Two Sorted Ranges of Key-Value Items CUDASTDMerge_1CUDASTDMergeKeyValueItems Taskflow provides standalone template methods for merging two sorted ranges of items into a sorted range of items. Include the Header You need to include the header file, taskflow/cuda/algorithm/merge.hpp, for using the parallel-merge algorithm. #include<taskflow/cuda/algorithm/merge.hpp> Merge Two Sorted Ranges of Items tf::cuda_merge merges two sorted ranges of items into a sorted range. The following code merges two sorted arrays input_1 and input_2, each of 1000 items, into a sorted array output of 2000 items. constsize_tN=1000; int*input_1=tf::cuda_malloc_shared<int>(N);//inputvector1 int*input_2=tf::cuda_malloc_shared<int>(N);//inputvector2 int*output=tf::cuda_malloc_shared<int>(2*N);//outputvector //initializesthedata for(size_ti=0;i<N;i++){ input_1[i]=rand()%100; input_2[i]=rand()%100; } std::sort(input_1,input1+N); std::sort(input_2,input2+N); //createanexecutionpolicy tf::cudaStreamstream; tf::cudaDefaultExecutionPolicypolicy(stream); //queriestherequiredbuffersizetomergetwoN-elementsortedvectors autobytes=policy.merge_bufsz(N,N); autobuffer=tf::cuda_malloc_device<std::byte>(bytes); //mergeinput_1andinput_2tooutput tf::cuda_merge(policy, input_1,input_1+N,input_2,input_2+N,output, []__device__(inta,intb){returna<b;},//comparator buffer ); //synchronizestheexecutionandverifiestheresult stream.synchronize(); //verifytheresult assert(std::is_sorted(output,output+2*N)); //deletethebuffer cudaFree(input1); cudaFree(input2); cudaFree(output); cudaFree(buffer); The merge algorithm runs asynchronously through the stream specified in the execution policy. You need to synchronize the stream to obtain correct results. Since the GPU merge algorithm may require extra buffer to store the temporary results, you need to provide a buffer of size at least larger or equal to the value returned from tf::cudaDefaultExecutionPolicy::merge_bufsz. The buffer size depends only on the two input vector sizes. You must keep the buffer alive before the merge call completes. Merge Two Sorted Ranges of Key-Value Items tf::cuda_merge_by_key performs key-value merge over two sorted ranges in a similar way to tf::cuda_merge; additionally, it copies elements from the two ranges of values associated with the two input keys, respectively. The following code performs key-value merge over a and b: constsize_tN=2; int*a_keys=tf::cuda_malloc_shared<int>(N); int*a_vals=tf::cuda_malloc_shared<int>(N); int*b_keys=tf::cuda_malloc_shared<int>(N); int*b_vals=tf::cuda_malloc_shared<int>(N); int*c_keys=tf::cuda_malloc_shared<int>(2*N); int*c_vals=tf::cuda_malloc_shared<int>(2*N); //initializesthedata a_keys[0]=8,a_keys[1]=1; a_vals[0]=1,a_vals[1]=2; b_keys[0]=3,b_keys[1]=7; b_vals[0]=3,b_vals[1]=4; //createanexecutionpolicy tf::cudaStreamstream; tf::cudaDefaultExecutionPolicypolicy(stream); //queriestherequiredbuffersizetomergetwoN-elementsortedvectorsbykeys autobytes=policy.merge_bufsz(N,N); autobuffer=tf::cuda_malloc_device<std::byte>(bytes); //mergekeysandvaluesofaandbtoc tf::cuda_merge_by_key( policy, a_keys,a_keys+N,a_vals, b_keys,b_keys+N,b_vals, c_keys,c_vals, []__device__(inta,intb){returna<b;},//comparator buffer ); //waitforthemergetocomplete stream.synchronize(); //now,c_keys={1,3,7,8} //now,c_vals={2,3,4,1} //deletethedevicememory cudaFree(buffer); cudaFree(a_keys); cudaFree(b_keys); cudaFree(c_keys); cudaFree(a_vals); cudaFree(b_vals); cudaFree(c_vals); Since the GPU merge algorithm may require extra buffer to store the temporary results, you need to provide a buffer of size at least larger or equal to the value returned from tf::cudaDefaultExecutionPolicy::merge_bufsz. The buffer size depends only on the two input vector sizes.