tf::cudaFlowCapturer taskflow/cuda/cuda_capturer.hpp tf::cudaFlowCapturer::External tf::cudaFlowCapturer::Internal std::variant< External, Internal > using tf::cudaFlowCapturer::handle_t = std::variant<External, Internal> handle_t std::variant< cudaFlowRoundRobinOptimizer, cudaFlowSequentialOptimizer, cudaFlowLinearOptimizer > using tf::cudaFlowCapturer::Optimizer = std::variant< cudaFlowRoundRobinOptimizer, cudaFlowSequentialOptimizer, cudaFlowLinearOptimizer > Optimizer class friend class cudaFlow cudaFlow cudaFlow class friend class Executor Executor Executor cudaFlowGraph cudaFlowGraph tf::cudaFlowCapturer::_cfg _cfg Optimizer Optimizer tf::cudaFlowCapturer::_optimizer _optimizer cudaGraphExec cudaGraphExec tf::cudaFlowCapturer::_exe _exe {nullptr} tf::cudaFlowCapturer::cudaFlowCapturer ()=default cudaFlowCapturer constructs a standalone cudaFlowCapturer A standalone cudaFlow capturer does not go through any taskflow and can be run by the caller thread using tf::cudaFlowCapturer::run. tf::cudaFlowCapturer::~cudaFlowCapturer ()=default ~cudaFlowCapturer destructs the cudaFlowCapturer tf::cudaFlowCapturer::cudaFlowCapturer (cudaFlowCapturer &&)=default cudaFlowCapturer cudaFlowCapturer && default move constructor cudaFlowCapturer & cudaFlowCapturer& tf::cudaFlowCapturer::operator= (cudaFlowCapturer &&)=default operator= cudaFlowCapturer && default move assignment operator bool bool tf::cudaFlowCapturer::empty () const empty queries the emptiness of the graph size_t size_t tf::cudaFlowCapturer::num_tasks () const num_tasks queries the number of tasks void void tf::cudaFlowCapturer::clear () clear clear this cudaFlow capturer void void tf::cudaFlowCapturer::dump (std::ostream &os) const dump std::ostream & os dumps the cudaFlow graph into a DOT format through an output stream void void tf::cudaFlowCapturer::dump_native_graph (std::ostream &os) const dump_native_graph std::ostream & os dumps the native captured graph into a DOT format through an output stream typename C std::enable_if_t< std::is_invocable_r_v< void, C, cudaStream_t >, void > * nullptr cudaTask cudaTask tf::cudaFlowCapturer::on (C &&callable) on C && callable captures a sequential CUDA operations from the given callable C callable type constructible with std::function<void(cudaStream_t)> callable a callable to capture CUDA operations with the stream This methods applies a stream created by the flow to capture a sequence of CUDA operations defined in the callable. typename C std::enable_if_t< std::is_invocable_r_v< void, C, cudaStream_t >, void > * nullptr void void tf::cudaFlowCapturer::on (cudaTask task, C &&callable) on cudaTask task C && callable updates a capture task to another sequential CUDA operations The method is similar to cudaFlowCapturer::on but operates on an existing task. cudaTask cudaTask tf::cudaFlowCapturer::noop () noop captures a no-operation task a tf::cudaTask handle An empty node performs no operation during execution, but can be used for transitive ordering. For example, a phased execution graph with 2 groups of n nodes with a barrier between them can be represented using an empty node and 2*n dependency edges, rather than no empty node and n^2 dependency edges. void void tf::cudaFlowCapturer::noop (cudaTask task) noop cudaTask task updates a task to a no-operation task The method is similar to tf::cudaFlowCapturer::noop but operates on an existing task. cudaTask cudaTask tf::cudaFlowCapturer::memcpy (void *dst, const void *src, size_t count) memcpy void * dst const void * src size_t count copies data between host and device asynchronously through a stream dst destination memory address src source memory address count size in bytes to copy The method captures a cudaMemcpyAsync operation through an internal stream. void void tf::cudaFlowCapturer::memcpy (cudaTask task, void *dst, const void *src, size_t count) memcpy cudaTask task void * dst const void * src size_t count updates a capture task to a memcpy operation The method is similar to cudaFlowCapturer::memcpy but operates on an existing task. typename T std::enable_if_t<!std::is_same_v< T, void >, void > * nullptr cudaTask cudaTask tf::cudaFlowCapturer::copy (T *tgt, const T *src, size_t num) copy T * tgt const T * src size_t num captures a copy task of typed data T element type (non-void) tgt pointer to the target memory block src pointer to the source memory block num number of elements to copy cudaTask handle A copy task transfers num*sizeof(T) bytes of data from a source location to a target location. Direction can be arbitrary among CPUs and GPUs. typename T std::enable_if_t<!std::is_same_v< T, void >, void > * nullptr void void tf::cudaFlowCapturer::copy (cudaTask task, T *tgt, const T *src, size_t num) copy cudaTask task T * tgt const T * src size_t num updates a capture task to a copy operation The method is similar to cudaFlowCapturer::copy but operates on an existing task. cudaTask cudaTask tf::cudaFlowCapturer::memset (void *ptr, int v, size_t n) memset void * ptr int v size_t n initializes or sets GPU memory to the given value byte by byte ptr pointer to GPU memory v value to set for each byte of the specified memory n size in bytes to set The method captures a cudaMemsetAsync operation through an internal stream to fill the first count bytes of the memory area pointed to by devPtr with the constant byte value value. void void tf::cudaFlowCapturer::memset (cudaTask task, void *ptr, int value, size_t n) memset cudaTask task void * ptr int value size_t n updates a capture task to a memset operation The method is similar to cudaFlowCapturer::memset but operates on an existing task. typename F typename... ArgsT ArgsT cudaTask cudaTask tf::cudaFlowCapturer::kernel (dim3 g, dim3 b, size_t s, F f, ArgsT &&... args) kernel dim3 g dim3 b size_t s F f ArgsT &&... args captures a kernel F kernel function type ArgsT kernel function parameters type g configured grid b configured block s configured shared memory size in bytes f kernel function args arguments to forward to the kernel function by copy cudaTask handle typename F typename... ArgsT ArgsT void void tf::cudaFlowCapturer::kernel (cudaTask task, dim3 g, dim3 b, size_t s, F f, ArgsT &&... args) kernel cudaTask task dim3 g dim3 b size_t s F f ArgsT &&... args updates a capture task to a kernel operation The method is similar to cudaFlowCapturer::kernel but operates on an existing task. typename C cudaTask cudaTask tf::cudaFlowCapturer::single_task (C c) single_task C c capturers a kernel to runs the given callable with only one thread C callable type c callable to run by a single kernel thread typename C void void tf::cudaFlowCapturer::single_task (cudaTask task, C c) single_task cudaTask task C c updates a capture task to a single-threaded kernel This method is similar to cudaFlowCapturer::single_task but operates on an existing task. typename I typename C cudaTask cudaTask tf::cudaFlowCapturer::for_each (I first, I last, C callable) for_each I first I last C callable captures a kernel that applies a callable to each dereferenced element of the data array I iterator type C callable type first iterator to the beginning last iterator to the end callable a callable object to apply to the dereferenced iterator cudaTask handle This method is equivalent to the parallel execution of the following loop on a GPU: for(autoitr=first;itr!=last;i++){ callable(*itr); } typename I typename C void void tf::cudaFlowCapturer::for_each (cudaTask task, I first, I last, C callable) for_each cudaTask task I first I last C callable updates a capture task to a for-each kernel task This method is similar to cudaFlowCapturer::for_each but operates on an existing task. typename I typename C cudaTask cudaTask tf::cudaFlowCapturer::for_each_index (I first, I last, I step, C callable) for_each_index I first I last I step C callable captures a kernel that applies a callable to each index in the range with the step size I index type C callable type first beginning index last last index step step size callable the callable to apply to each element in the data array cudaTask handle This method is equivalent to the parallel execution of the following loop on a GPU: //stepispositive[first,last) for(autoi=first;i<last;i+=step){ callable(i); } //stepisnegative[first,last) for(autoi=first;i>last;i+=step){ callable(i); } typename I typename C void void tf::cudaFlowCapturer::for_each_index (cudaTask task, I first, I last, I step, C callable) for_each_index cudaTask task I first I last I step C callable updates a capture task to a for-each-index kernel task This method is similar to cudaFlowCapturer::for_each_index but operates on an existing task. typename I typename O typename C cudaTask cudaTask tf::cudaFlowCapturer::transform (I first, I last, O output, C op) transform I first I last O output C op captures a kernel that transforms an input range to an output range I input iterator type O output iterator type C unary operator type first iterator to the beginning of the input range last iterator to the end of the input range output iterator to the beginning of the output range op unary operator to apply to transform each item in the range cudaTask handle This method is equivalent to the parallel execution of the following loop on a GPU: while(first!=last){ *output++=op(*first++); } typename I typename O typename C void void tf::cudaFlowCapturer::transform (cudaTask task, I first, I last, O output, C op) transform cudaTask task I first I last O output C op updates a capture task to a transform kernel task This method is similar to cudaFlowCapturer::transform but operates on an existing task. typename I1 typename I2 typename O typename C cudaTask cudaTask tf::cudaFlowCapturer::transform (I1 first1, I1 last1, I2 first2, O output, C op) transform I1 first1 I1 last1 I2 first2 O output C op captures a kernel that transforms two input ranges to an output range I1 first input iterator type I2 second input iterator type O output iterator type C unary operator type first1 iterator to the beginning of the input range last1 iterator to the end of the input range first2 iterato output iterator to the beginning of the output range op binary operator to apply to transform each pair of items in the two input ranges cudaTask handle This method is equivalent to the parallel execution of the following loop on a GPU: while(first1!=last1){ *output++=op(*first1++,*first2++); } typename I1 typename I2 typename O typename C void void tf::cudaFlowCapturer::transform (cudaTask task, I1 first1, I1 last1, I2 first2, O output, C op) transform cudaTask task I1 first1 I1 last1 I2 first2 O output C op updates a capture task to a transform kernel task This method is similar to cudaFlowCapturer::transform but operates on an existing task. typename OPT typename... ArgsT ArgsT OPT & OPT & tf::cudaFlowCapturer::make_optimizer (ArgsT &&... args) make_optimizer ArgsT &&... args selects a different optimization algorithm OPT optimizer type ArgsT arguments types args arguments to forward to construct the optimizer a reference to the optimizer We currently supports the following optimization algorithms to capture a user-described cudaFlow: tf::cudaFlowSequentialOptimizer tf::cudaFlowRoundRobinOptimizer tf::cudaFlowLinearOptimizer By default, tf::cudaFlowCapturer uses the round-robin optimization algorithm with four streams to transform a user-level graph into a native CUDA graph. cudaGraph_t cudaGraph_t tf::cudaFlowCapturer::capture () capture captures the cudaFlow and turns it into a CUDA Graph void void tf::cudaFlowCapturer::run (cudaStream_t stream) run cudaStream_t stream offloads the cudaFlowCapturer onto a GPU asynchronously via a stream stream stream for performing this operation Offloads the present cudaFlowCapturer onto a GPU asynchronously via the given stream. An offloaded cudaFlowCapturer forces the underlying graph to be instantiated. After the instantiation, you should not modify the graph topology but update node parameters. cudaGraph_t cudaGraph_t tf::cudaFlowCapturer::native_graph () native_graph acquires a reference to the underlying CUDA graph cudaGraphExec_t cudaGraphExec_t tf::cudaFlowCapturer::native_executable () native_executable acquires a reference to the underlying CUDA graph executable class to create a cudaFlow graph using stream capture The usage of tf::cudaFlowCapturer is similar to tf::cudaFlow, except users can call the method tf::cudaFlowCapturer::on to capture a sequence of asynchronous CUDA operations through the given stream. The following example creates a CUDA graph that captures two kernel tasks, task_1 and task_2, where task_1 runs before task_2. taskflow.emplace([](tf::cudaFlowCapturer&capturer){ //capturemy_kernel_1throughthegivenstreammanagedbythecapturer autotask_1=capturer.on([&](cudaStream_tstream){ my_kernel_1<<<grid_1,block_1,shm_size_1,stream>>>(my_parameters_1); }); //capturemy_kernel_2throughthegivenstreammanagedbythecapturer autotask_2=capturer.on([&](cudaStream_tstream){ my_kernel_2<<<grid_2,block_2,shm_size_2,stream>>>(my_parameters_2); }); task_1.precede(task_2); }); Similar to tf::cudaFlow, a cudaFlowCapturer is a task (tf::Task) created from tf::Taskflow and will be run by one worker thread in the executor. That is, the callable that describes a cudaFlowCapturer will be executed sequentially. Inside a cudaFlow capturer task, different GPU tasks (tf::cudaTask) may run in parallel depending on the selected optimization algorithm. By default, we use tf::cudaFlowRoundRobinOptimizer to transform a user-level graph into a native CUDA graph. Please refer to GPU Tasking (cudaFlowCapturer) for details. tf::cudaFlowCapturer_cfg tf::cudaFlowCapturer_exe tf::cudaFlowCapturer_optimizer tf::cudaFlowCapturercapture tf::cudaFlowCapturerclear tf::cudaFlowCapturercopy tf::cudaFlowCapturercopy tf::cudaFlowCapturercudaFlow tf::cudaFlowCapturercudaFlowCapturer tf::cudaFlowCapturercudaFlowCapturer tf::cudaFlowCapturerdump tf::cudaFlowCapturerdump_native_graph tf::cudaFlowCapturerempty tf::cudaFlowCapturerExecutor tf::cudaFlowCapturerfor_each tf::cudaFlowCapturerfor_each tf::cudaFlowCapturerfor_each_index tf::cudaFlowCapturerfor_each_index tf::cudaFlowCapturerhandle_t tf::cudaFlowCapturerkernel tf::cudaFlowCapturerkernel tf::cudaFlowCapturermake_optimizer tf::cudaFlowCapturermemcpy tf::cudaFlowCapturermemcpy tf::cudaFlowCapturermemset tf::cudaFlowCapturermemset tf::cudaFlowCapturernative_executable tf::cudaFlowCapturernative_graph tf::cudaFlowCapturernoop tf::cudaFlowCapturernoop tf::cudaFlowCapturernum_tasks tf::cudaFlowCaptureron tf::cudaFlowCaptureron tf::cudaFlowCaptureroperator= tf::cudaFlowCapturerOptimizer tf::cudaFlowCapturerrun tf::cudaFlowCapturersingle_task tf::cudaFlowCapturersingle_task tf::cudaFlowCapturertransform tf::cudaFlowCapturertransform tf::cudaFlowCapturertransform tf::cudaFlowCapturertransform tf::cudaFlowCapturer~cudaFlowCapturer