GPU Tasking (cudaFlowCapturer)

GPUTaskingcudaFlowCapturer GPU Tasking (cudaFlowCapturer) Include the Header GPUTaskingcudaFlowCapturer_1GPUTaskingcudaFlowCapturerIncludeTheHeader Capture a cudaFlow GPUTaskingcudaFlowCapturer_1Capture_a_cudaFlow Common Capture Methods GPUTaskingcudaFlowCapturer_1CommonCaptureMethods Create a Capturer on a Specific GPU GPUTaskingcudaFlowCapturer_1CreateACapturerOnASpecificGPU Create a Capturer from a cudaFlow GPUTaskingcudaFlowCapturer_1CreateACapturerWithinAcudaFlow Offload a cudaFlow Capturer GPUTaskingcudaFlowCapturer_1OffloadAcudaFlowCapturer Update a cudaFlow Capturer GPUTaskingcudaFlowCapturer_1UpdateAcudaFlowCapturer Integrate a cudaFlow Capturer into Taskflow GPUTaskingcudaFlowCapturer_1IntegrateCudaFlowCapturerIntoTaskflow You can create a cudaFlow through stream capture, which allows you to implicitly capture a CUDA graph using stream-based interface. Compared to explicit CUDA Graph construction (tf::cudaFlow), implicit CUDA Graph capturing (tf::cudaFlowCapturer) is more flexible in building GPU task graphs. Include the Header You need to include the header file, taskflow/cuda/cudaflow.hpp, for capturing a GPU task graph using tf::cudaFlowCapturer. #include<taskflow/cuda/cudaflow.hpp> Capture a cudaFlow When your program has no access to direct kernel calls but can only invoke them through a stream-based interface (e.g., cuBLAS and cuDNN library functions), you can use tf::cudaFlowCapturer to capture the hidden GPU operations into a CUDA graph. A cudaFlowCapturer is similar to a cudaFlow except it constructs a GPU task graph through stream capture. You use the method tf::cudaFlowCapturer::on to capture a sequence of asynchronous GPU operations through the given stream. The following example creates a CUDA graph that captures two kernel tasks, task_1 (my_kernel_1) and task_2 (my_kernel_2) , where task_1 runs before task_2. //createacudaFlowcapturertorunaCUDAgraphusingstreamcapturing tf::cudaFlowCapturercapturer; //capturemy_kernel_1throughastreammanagedbycapturer tf::cudaTasktask_1=capturer.on([&](cudaStream_tstream){ my_kernel_1<<<grid_1,block_1,shm_size_1,stream>>>(my_parameters_1); }).name("my_kernel_1"); //capturemy_kernel_2throughastreammanagedbycapturer tf::cudaTasktask_2=capturer.on([&](cudaStream_tstream){ my_kernel_2<<<grid_2,block_2,shm_size_2,stream>>>(my_parameters_2); }).name("my_kernel_2"); //my_kernel_1runsbeforemy_kernel_2 task_1.precede(task_2); //offloadcapturedGPUtasksusingtheCUDAGraphexecutionmodel tf::cudaStreamstream; capturer.run(stream); stream.synchronize(); //dumpthecudaFlowtoaDOTformatthroughstd::cout capturer.dump(std::cout) Inside tf::cudaFlowCapturer::on, you should NOT modify the properties of the stream argument but only use it to capture asynchronous GPU operations (e.g., kernel, cudaMemcpyAsync). The stream argument is internal to the capturer use only. Common Capture Methods tf::cudaFlowCapturer defines a set of methods for capturing common GPU operations, such as tf::cudaFlowCapturer::kernel, tf::cudaFlowCapturer::memcpy, tf::cudaFlowCapturer::memset, and so on. For example, the following code snippet uses these pre-defined methods to construct a GPU task graph of one host-to-device copy, kernel, and one device-to-host copy, in this order of their dependencies. tf::cudaFlowCapturercapturer; //copydatafromhost_datatogpu_data tf::cudaTaskh2d=capturer.memcpy(gpu_data,host_data,bytes) .name("h2d"); //capturemy_kerneltodocomputationongpu_data tf::cudaTaskkernel=capturer.kernel(grid,block,shm_size,kernel,kernel_args); .name("my_kernel"); //copydatafromgpu_datatohost_data tf::cudaTaskd2h=capturer.memcpy(host_data,gpu_data,bytes) .name("d2h"); //buildtaskdependencies h2d.precede(kernel); kernel.precede(d2h); Create a Capturer on a Specific GPU You can run a cudaFlow capturer on a specific GPU by switching to the context of that GPU using tf::cudaScopedDevice, following the CUDA convention of multi-GPU programming. The example below creates a cudaFlow capturer and runs it on GPU 2: { //createanRAII-styledswitchertothecontextofGPU2 tf::cudaScopedDevicecontext(2); //createacudaFlowcapturerunderGPU2 tf::cudaFlowCapturercapturer; //... //createastreamunderGPU2andoffloadthecapturertothatGPU tf::cudaStreamstream; capturer.run(stream); stream.synchronize(); } tf::cudaScopedDevice is an RAII-styled wrapper to perform scoped switch to the given GPU context. When the scope is destroyed, it switches back to the original context. By default, a cudaFlow capturer runs on the current GPU associated with the caller, which is typically 0. Create a Capturer from a cudaFlow Within a parent cudaFlow, you can capture a cudaFlow to form a subflow that eventually becomes a child node in the underlying CUDA task graph. The following example defines a captured flow task2 of two dependent tasks, task2_1 and task2_2, and task2 runs after task1. tf::cudaFlowcudaflow; tf::cudaTasktask1=cudaflow.kernel(grid,block,shm,my_kernel,args...) .name("kernel"); //task2formsasubflowasachildnodeintheunderlyingCUDAgraph tf::cudaTasktask2=cudaflow.capture([&](tf::cudaFlowCapturer&capturer){ //capturekernel_1usingthegivenstream tf::cudaTasktask2_1=capturer.on([&](cudaStream_tstream){ kernel_2<<<grid1,block1,shm_size1,stream>>>(args1...); }).name("kernel_1"); //capturekernel_2usingthegivenstream tf::cudaTasktask2_2=capturer.on([&](cudaStream_tstream){ kernel_2<<<grid2,block2,shm_size2,stream>>>(args2...); }).name("kernel_2"); //kernel_1runsbeforekernel_2 task2_1.precede(task2_2); }).name("capturer"); task1.precede(task2); Offload a cudaFlow Capturer When you offload a cudaFlow capturer using tf::cudaFlowCapturer::run, the runtime transforms that capturer (i.e., application GPU task graph) into a native CUDA graph and an executable instance both optimized for maximum kernel concurrency. Depending on the optimization algorithm, the application GPU task graph may be different from the actual executable graph submitted to the CUDA runtime. tf::cudaStreamstream; //launchacudaflowcapturerasynchronouslythroughastream capturer.run(stream); //waitforthecudaflowtofinish stream.synchronize(); Update a cudaFlow Capturer Between successive offloads (i.e., executions of a cudaFlow capturer), you can update the captured task with a different set of parameters. Every task-creation method in tf::cudaFlowCapturer has an overload to update the parameters of a created task by that method. The following example creates a kernel task and updates its parameter between successive runs: tf::cudaStreamstream; tf::cudaFlowCapturercf; //createakerneltask tf::cudaTasktask=cf.kernel(grid1,block1,shm1,kernel,kernel_args_1); cf.run(stream); stream.synchronize(); //updatethecreatedkerneltaskwithdifferentparameters cf.kernel(task,grid2,block2,shm2,kernel,kernel_args_2); cf.run(stream); stream.synchronize(); When you run a updated cudaFlow capturer, Taskflow will try to update the underlying executable with the newly captured graph first. If that update is unsuccessful, Taskflow will destroy the executable graph and re-instantiate a new one from the newly captured graph. Integrate a cudaFlow Capturer into Taskflow You can create a task to enclose a cudaFlow capturer and run it from a worker thread. The usage of the capturer remains the same except that the capturer is run by a worker thread from a taskflow task. The following example runs a cudaFlow capturer from a static task: tf::Executorexecutor; tf::Taskflowtaskflow; taskflow.emplace([](){ //createacudaFlowcapturerinsideastatictask tf::cudaFlowCapturercapturer; //...captureaGPUtaskgraph capturer.kernel(...); //runthecapturerthroughastream tf::cudaStreamstream; capturer.run(stream); stream.synchronize(); });