k-means Clustering (cudaFlow)

kmeans_cudaflow k-means Clustering (cudaFlow) Define the k-means Kernels kmeans_cudaflow_1DefineTheKMeansKernels Define the k-means cudaFlow kmeans_cudaflow_1DefineTheKMeanscudaFlow Benchmarking kmeans_cudaflow_1KMeanscudaFlowBenchmarking Following up on k-means Clustering, this page studies how to accelerate a k-means workload on a GPU using tf::cudaFlow. Define the k-means Kernels Recall that the k-means algorithm has the following steps: Step 1: initialize k random centroids Step 2: for every data point, find the nearest centroid (L2 distance or other measurements) and assign the point to it Step 3: for every centroid, move the centroid to the average of the points assigned to that centroid Step 4: go to Step 2 until converged (no more changes in the last few iterations) or maximum iterations reached We observe Step 2 and Step 3 of the algorithm are parallelizable across individual points for use to harness the power of GPU: for every data point, find the nearest centroid (L2 distance or other measurements) and assign the point to it for every centroid, move the centroid to the average of the points assigned to that centroid. At a fine-grained level, we request one GPU thread to work on one point for Step 2 and one GPU thread to work on one centroid for Step 3. //px/py:2Dpoints //N:numberofpoints //mx/my:centroids //K:numberofclusters //sx/sy/c:storagetocomputetheaverage __global__voidassign_clusters( float*px,float*py,intN, float*mx,float*my,float*sx,float*sy,intK,int*c ){ constintindex=blockIdx.x*blockDim.x+threadIdx.x; if(index>=N){ return; } //Makegloballoadsonce. floatx=px[index]; floaty=py[index]; floatbest_dance=FLT_MAX; intbest_k=0; for(intk=0;k<K;++k){ floatd=L2(x,y,mx[k],my[k]); if(d<best_d){ best_d=d; best_k=k; } } atomicAdd(&sx[best_k],x); atomicAdd(&sy[best_k],y); atomicAdd(&c[best_k],1); } //mx/my:centroids,sx/sy/c:storagetocomputetheaverage __global__voidcompute_new_means( float*mx,float*my,float*sx,float*sy,int*c ){ intk=threadIdx.x; intcount=max(1,c[k]);//turn0/0to0/1 mx[k]=sx[k]/count; my[k]=sy[k]/count; } When we recompute the cluster centroids to be the mean of all points assigned to a particular centroid, multiple GPU threads may access the sum arrays, sx and sy, and the count array, c. To avoid data race, we use a simple atomicAdd method. Define the k-means cudaFlow Based on the two kernels, we can define the cudaFlow for the k-means workload below: //N:numberofpoints //K:numberofclusters //M:numberofiterations //px/py:2Dpointvector voidkmeans_gpu( intN,intK,intM,cconststd::vector<float>&px,conststd::vector<float>&py ){ std::vector<float>h_mx,h_my; float*d_px,*d_py,*d_mx,*d_my,*d_sx,*d_sy,*d_c; for(inti=0;i<K;++i){ h_mx.push_back(h_px[i]); h_my.push_back(h_py[i]); } //createataskflowgraph tf::Executorexecutor; tf::Taskflowtaskflow("K-Means"); autoallocate_px=taskflow.emplace([&](){ TF_CHECK_CUDA(cudaMalloc(&d_px,N*sizeof(float)),"failedtoallocated_px"); }).name("allocate_px"); autoallocate_py=taskflow.emplace([&](){ TF_CHECK_CUDA(cudaMalloc(&d_py,N*sizeof(float)),"failedtoallocated_py"); }).name("allocate_py"); autoallocate_mx=taskflow.emplace([&](){ TF_CHECK_CUDA(cudaMalloc(&d_mx,K*sizeof(float)),"failedtoallocated_mx"); }).name("allocate_mx"); autoallocate_my=taskflow.emplace([&](){ TF_CHECK_CUDA(cudaMalloc(&d_my,K*sizeof(float)),"failedtoallocated_my"); }).name("allocate_my"); autoallocate_sx=taskflow.emplace([&](){ TF_CHECK_CUDA(cudaMalloc(&d_sx,K*sizeof(float)),"failedtoallocated_sx"); }).name("allocate_sx"); autoallocate_sy=taskflow.emplace([&](){ TF_CHECK_CUDA(cudaMalloc(&d_sy,K*sizeof(float)),"failedtoallocated_sy"); }).name("allocate_sy"); autoallocate_c=taskflow.emplace([&](){ TF_CHECK_CUDA(cudaMalloc(&d_c,K*sizeof(float)),"failedtoallocatedc"); }).name("allocate_c"); autoh2d=taskflow.emplace([&](){ cudaMemcpy(d_px,h_px.data(),N*sizeof(float),cudaMemcpyDefault); cudaMemcpy(d_py,h_py.data(),N*sizeof(float),cudaMemcpyDefault); cudaMemcpy(d_mx,h_mx.data(),K*sizeof(float),cudaMemcpyDefault); cudaMemcpy(d_my,h_my.data(),K*sizeof(float),cudaMemcpyDefault); }).name("h2d"); autokmeans=taskflow.emplace([&](){ tf::cudaFlowcf; autozero_c=cf.zero(d_c,K).name("zero_c"); autozero_sx=cf.zero(d_sx,K).name("zero_sx"); autozero_sy=cf.zero(d_sy,K).name("zero_sy"); autocluster=cf.kernel( (N+512-1)/512,512,0, assign_clusters,d_px,d_py,N,d_mx,d_my,d_sx,d_sy,K,d_c ).name("cluster"); autonew_centroid=cf.kernel( 1,K,0, compute_new_means,d_mx,d_my,d_sx,d_sy,d_c ).name("new_centroid"); cluster.precede(new_centroid) .succeed(zero_c,zero_sx,zero_sy); //RepeattheexecutionforMtimes tf::cudaStreamstream; for(inti=0;i<M;i++){ cf.run(stream); } stream.synchronize(); }).name("update_means"); autostop=taskflow.emplace([&](){ cudaMemcpy(h_mx.data(),d_mx,K*sizeof(float),cudaMemcpyDefault); cudaMemcpy(h_my.data(),d_my,K*sizeof(float),cudaMemcpyDefault); }).name("d2h"); autofree=taskflow.emplace([&](){ cudaFree(d_px); cudaFree(d_py); cudaFree(d_mx); cudaFree(d_my); cudaFree(d_sx); cudaFree(d_sy); cudaFree(d_c); }).name("free"); //buildupthedependency h2d.succeed(allocate_px,allocate_py,allocate_mx,allocate_my); kmeans.succeed(allocate_sx,allocate_sy,allocate_c,h2d) .precede(stop); stop.precede(free); //runthetaskflow executor.run(taskflow).wait(); //std::cout<<"dumpingkmeansgraph...\n"; taskflow.dump(std::cout); return{h_mx,h_my}; } The first dump before executing the taskflow produces the following diagram. The condition tasks introduces a cycle between itself and update_means. Each time it goes back to update_means, the cudaFlow is reconstructed with captured parameters in the closure and offloaded to the GPU. The main cudaFlow task, update_means, must not run before all required data has settled down. It precedes a condition task that circles back to itself until we reach M iterations. When iteration completes, the condition task directs the execution path to the cudaFlow, h2d, to copy the results of clusters to h_mx and h_my and then deallocate all GPU memory. Benchmarking We run three versions of k-means, sequential CPU, parallel CPUs, and one GPU, on a machine of 12 Intel i7-8700 CPUs at 3.20 GHz and a Nvidia RTX 2080 GPU using various numbers of 2D point counts and iterations. N K M CPU Sequential CPU Parallel GPU 10 5 10 0.14 ms 77 ms 1 ms 100 10 100 0.56 ms 86 ms 7 ms 1000 10 1000 10 ms 98 ms 55 ms 10000 10 10000 1006 ms 713 ms 458 ms 100000 10 100000 102483 ms 49966 ms 7952 ms

When the number of points is larger than 10K, both parallel CPU and GPU implementations start to pick up the speed over than the sequential version. We can see that using the built-in predicate, tf::cudaFlow::offload_n, can avoid repetitively creating the graph over and over, resulting in two times faster than conditional tasking.