tf::cudaFlow taskflow/cuda/cudaflow.hpp cudaFlowGraph cudaFlowGraph tf::cudaFlow::_cfg _cfg cudaGraphExec cudaGraphExec tf::cudaFlow::_exe _exe {nullptr} tf::cudaFlow::cudaFlow () cudaFlow constructs a cudaFlow tf::cudaFlow::~cudaFlow ()=default ~cudaFlow destroys the cudaFlow and its associated native CUDA graph and executable graph tf::cudaFlow::cudaFlow (cudaFlow &&)=default cudaFlow cudaFlow && default move constructor cudaFlow & cudaFlow& tf::cudaFlow::operator= (cudaFlow &&)=default operator= cudaFlow && default move assignment operator bool bool tf::cudaFlow::empty () const empty queries the emptiness of the graph size_t size_t tf::cudaFlow::num_tasks () const num_tasks queries the number of tasks void void tf::cudaFlow::clear () clear clears the cudaFlow object void void tf::cudaFlow::dump (std::ostream &os) const dump std::ostream & os dumps the cudaFlow graph into a DOT format through an output stream void void tf::cudaFlow::dump_native_graph (std::ostream &os) const dump_native_graph std::ostream & os dumps the native CUDA graph into a DOT format through an output stream The native CUDA graph may be different from the upper-level cudaFlow graph when flow capture is involved. cudaTask cudaTask tf::cudaFlow::noop () noop creates a no-operation task a tf::cudaTask handle An empty node performs no operation during execution, but can be used for transitive ordering. For example, a phased execution graph with 2 groups of n nodes with a barrier between them can be represented using an empty node and 2*n dependency edges, rather than no empty node and n^2 dependency edges. typename C cudaTask cudaTask tf::cudaFlow::host (C &&callable) host C && callable creates a host task that runs a callable on the host C callable type callable a callable object with neither arguments nor return (i.e., constructible from std::function<void()>) a tf::cudaTask handle A host task can only execute CPU-specific functions and cannot do any CUDA calls (e.g., cudaMalloc). typename C void void tf::cudaFlow::host (cudaTask task, C &&callable) host cudaTask task C && callable updates parameters of a host task The method is similar to tf::cudaFlow::host but operates on a task of type tf::cudaTaskType::HOST. typename F typename... ArgsT ArgsT cudaTask cudaTask tf::cudaFlow::kernel (dim3 g, dim3 b, size_t s, F f, ArgsT... args) kernel dim3 g dim3 b size_t s F f ArgsT... args creates a kernel task F kernel function type ArgsT kernel function parameters type g configured grid b configured block s configured shared memory size in bytes f kernel function args arguments to forward to the kernel function by copy a tf::cudaTask handle typename F typename... ArgsT ArgsT void void tf::cudaFlow::kernel (cudaTask task, dim3 g, dim3 b, size_t shm, F f, ArgsT... args) kernel cudaTask task dim3 g dim3 b size_t shm F f ArgsT... args updates parameters of a kernel task The method is similar to tf::cudaFlow::kernel but operates on a task of type tf::cudaTaskType::KERNEL. The kernel function name must NOT change. cudaTask cudaTask tf::cudaFlow::memset (void *dst, int v, size_t count) memset void * dst int v size_t count creates a memset task that fills untyped data with a byte value dst pointer to the destination device memory area v value to set for each byte of specified memory count size in bytes to set a tf::cudaTask handle A memset task fills the first count bytes of device memory area pointed by dst with the byte value v. void void tf::cudaFlow::memset (cudaTask task, void *dst, int ch, size_t count) memset cudaTask task void * dst int ch size_t count updates parameters of a memset task The method is similar to tf::cudaFlow::memset but operates on a task of type tf::cudaTaskType::MEMSET. The source/destination memory may have different address values but must be allocated from the same contexts as the original source/destination memory. cudaTask cudaTask tf::cudaFlow::memcpy (void *tgt, const void *src, size_t bytes) memcpy void * tgt const void * src size_t bytes creates a memcpy task that copies untyped data in bytes tgt pointer to the target memory block src pointer to the source memory block bytes bytes to copy a tf::cudaTask handle A memcpy task transfers bytes of data from a source location to a target location. Direction can be arbitrary among CPUs and GPUs. void void tf::cudaFlow::memcpy (cudaTask task, void *tgt, const void *src, size_t bytes) memcpy cudaTask task void * tgt const void * src size_t bytes updates parameters of a memcpy task The method is similar to tf::cudaFlow::memcpy but operates on a task of type tf::cudaTaskType::MEMCPY. The source/destination memory may have different address values but must be allocated from the same contexts as the original source/destination memory. typename T std::enable_if_t< is_pod_v< T > &&(sizeof(T)==1||sizeof(T)==2||sizeof(T)==4), void > * nullptr cudaTask cudaTask tf::cudaFlow::zero (T *dst, size_t count) zero T * dst size_t count creates a memset task that sets a typed memory block to zero T element type (size of T must be either 1, 2, or 4) dst pointer to the destination device memory area count number of elements a tf::cudaTask handle A zero task zeroes the first count elements of type T in a device memory area pointed by dst. typename T std::enable_if_t< is_pod_v< T > &&(sizeof(T)==1||sizeof(T)==2||sizeof(T)==4), void > * nullptr void void tf::cudaFlow::zero (cudaTask task, T *dst, size_t count) zero cudaTask task T * dst size_t count updates parameters of a memset task to a zero task The method is similar to tf::cudaFlow::zero but operates on a task of type tf::cudaTaskType::MEMSET. The source/destination memory may have different address values but must be allocated from the same contexts as the original source/destination memory. typename T std::enable_if_t< is_pod_v< T > &&(sizeof(T)==1||sizeof(T)==2||sizeof(T)==4), void > * nullptr cudaTask cudaTask tf::cudaFlow::fill (T *dst, T value, size_t count) fill T * dst T value size_t count creates a memset task that fills a typed memory block with a value T element type (size of T must be either 1, 2, or 4) dst pointer to the destination device memory area value value to fill for each element of type T count number of elements a tf::cudaTask handle A fill task fills the first count elements of type T with value in a device memory area pointed by dst. The value to fill is interpreted in type T rather than byte. typename T std::enable_if_t< is_pod_v< T > &&(sizeof(T)==1||sizeof(T)==2||sizeof(T)==4), void > * nullptr void void tf::cudaFlow::fill (cudaTask task, T *dst, T value, size_t count) fill cudaTask task T * dst T value size_t count updates parameters of a memset task to a fill task The method is similar to tf::cudaFlow::fill but operates on a task of type tf::cudaTaskType::MEMSET. The source/destination memory may have different address values but must be allocated from the same contexts as the original source/destination memory. typename T std::enable_if_t<!std::is_same_v< T, void >, void > * nullptr cudaTask cudaTask tf::cudaFlow::copy (T *tgt, const T *src, size_t num) copy T * tgt const T * src size_t num creates a memcopy task that copies typed data T element type (non-void) tgt pointer to the target memory block src pointer to the source memory block num number of elements to copy a tf::cudaTask handle A copy task transfers num*sizeof(T) bytes of data from a source location to a target location. Direction can be arbitrary among CPUs and GPUs. typename T std::enable_if_t<!std::is_same_v< T, void >, void > * nullptr void void tf::cudaFlow::copy (cudaTask task, T *tgt, const T *src, size_t num) copy cudaTask task T * tgt const T * src size_t num updates parameters of a memcpy task to a copy task The method is similar to tf::cudaFlow::copy but operates on a task of type tf::cudaTaskType::MEMCPY. The source/destination memory may have different address values but must be allocated from the same contexts as the original source/destination memory. void void tf::cudaFlow::run (cudaStream_t stream) run cudaStream_t stream offloads the cudaFlow onto a GPU asynchronously via a stream stream stream for performing this operation Offloads the present cudaFlow onto a GPU asynchronously via the given stream. An offloaded cudaFlow forces the underlying graph to be instantiated. After the instantiation, you should not modify the graph topology but update node parameters. cudaGraph_t cudaGraph_t tf::cudaFlow::native_graph () native_graph acquires a reference to the underlying CUDA graph cudaGraphExec_t cudaGraphExec_t tf::cudaFlow::native_executable () native_executable acquires a reference to the underlying CUDA graph executable typename C cudaTask cudaTask tf::cudaFlow::single_task (C c) single_task C c runs a callable with only a single kernel thread C callable type c callable to run by a single kernel thread a tf::cudaTask handle typename C void void tf::cudaFlow::single_task (cudaTask task, C c) single_task cudaTask task C c updates a single-threaded kernel task This method is similar to cudaFlow::single_task but operates on an existing task. typename I typename C cudaTask cudaTask tf::cudaFlow::for_each (I first, I last, C callable) for_each I first I last C callable applies a callable to each dereferenced element of the data array I iterator type C callable type first iterator to the beginning (inclusive) last iterator to the end (exclusive) callable a callable object to apply to the dereferenced iterator a tf::cudaTask handle This method is equivalent to the parallel execution of the following loop on a GPU: for(autoitr=first;itr!=last;itr++){ callable(*itr); } typename I typename C void void tf::cudaFlow::for_each (cudaTask task, I first, I last, C callable) for_each cudaTask task I first I last C callable updates parameters of a kernel task created from tf::cudaFlow::for_each The type of the iterators and the callable must be the same as the task created from tf::cudaFlow::for_each. typename I typename C cudaTask cudaTask tf::cudaFlow::for_each_index (I first, I last, I step, C callable) for_each_index I first I last I step C callable applies a callable to each index in the range with the step size I index type C callable type first beginning index last last index step step size callable the callable to apply to each element in the data array a tf::cudaTask handle This method is equivalent to the parallel execution of the following loop on a GPU: //stepispositive[first,last) for(autoi=first;i<last;i+=step){ callable(i); } //stepisnegative[first,last) for(autoi=first;i>last;i+=step){ callable(i); } typename I typename C void void tf::cudaFlow::for_each_index (cudaTask task, I first, I last, I step, C callable) for_each_index cudaTask task I first I last I step C callable updates parameters of a kernel task created from tf::cudaFlow::for_each_index The type of the iterators and the callable must be the same as the task created from tf::cudaFlow::for_each_index. typename I typename O typename C cudaTask cudaTask tf::cudaFlow::transform (I first, I last, O output, C op) transform I first I last O output C op applies a callable to a source range and stores the result in a target range I input iterator type O output iterator type C unary operator type first iterator to the beginning of the input range last iterator to the end of the input range output iterator to the beginning of the output range op the operator to apply to transform each element in the range a tf::cudaTask handle This method is equivalent to the parallel execution of the following loop on a GPU: while(first!=last){ *output++=callable(*first++); } typename I typename O typename C void void tf::cudaFlow::transform (cudaTask task, I first, I last, O output, C c) transform cudaTask task I first I last O output C c updates parameters of a kernel task created from tf::cudaFlow::transform The type of the iterators and the callable must be the same as the task created from tf::cudaFlow::for_each. typename I1 typename I2 typename O typename C cudaTask cudaTask tf::cudaFlow::transform (I1 first1, I1 last1, I2 first2, O output, C op) transform I1 first1 I1 last1 I2 first2 O output C op creates a task to perform parallel transforms over two ranges of items I1 first input iterator type I2 second input iterator type O output iterator type C unary operator type first1 iterator to the beginning of the input range last1 iterator to the end of the input range first2 iterato output iterator to the beginning of the output range op binary operator to apply to transform each pair of items in the two input ranges cudaTask handle This method is equivalent to the parallel execution of the following loop on a GPU: while(first1!=last1){ *output++=op(*first1++,*first2++); } typename I1 typename I2 typename O typename C void void tf::cudaFlow::transform (cudaTask task, I1 first1, I1 last1, I2 first2, O output, C c) transform cudaTask task I1 first1 I1 last1 I2 first2 O output C c updates parameters of a kernel task created from tf::cudaFlow::transform The type of the iterators and the callable must be the same as the task created from tf::cudaFlow::for_each. typename C cudaTask cudaTask tf::cudaFlow::capture (C &&callable) capture C && callable constructs a subflow graph through tf::cudaFlowCapturer C callable type constructible from std::function<void(tf::cudaFlowCapturer&)> callable the callable to construct a capture flow a tf::cudaTask handle A captured subflow forms a sub-graph to the cudaFlow and can be used to capture custom (or third-party) kernels that cannot be directly constructed from the cudaFlow. Example usage: taskflow.emplace([&](tf::cudaFlow&cf){ tf::cudaTaskmy_kernel=cf.kernel(my_arguments); //createaflowcapturertocapturecustomkernels tf::cudaTaskmy_subflow=cf.capture([&](tf::cudaFlowCapturer&capturer){ capturer.on([&](cudaStream_tstream){ invoke_custom_kernel_with_stream(stream,custom_arguments); }); }); my_kernel.precede(my_subflow); }); typename C void void tf::cudaFlow::capture (cudaTask task, C callable) capture cudaTask task C callable updates the captured child graph The method is similar to tf::cudaFlow::capture but operates on a task of type tf::cudaTaskType::SUBFLOW. The new captured graph must be topologically identical to the original captured graph. class to create a cudaFlow task dependency graph A cudaFlow is a high-level interface over CUDA Graph to perform GPU operations using the task dependency graph model. The class provides a set of methods for creating and launch different tasks on one or multiple CUDA devices, for instance, kernel tasks, data transfer tasks, and memory operation tasks. The following example creates a cudaFlow of two kernel tasks, task1 and task2, where task1 runs before task2. tf::Taskflowtaskflow; tf::Executorexecutor; taskflow.emplace([&](tf::cudaFlow&cf){ //createtwokerneltasks tf::cudaTasktask1=cf.kernel(grid1,block1,shm_size1,kernel1,args1); tf::cudaTasktask2=cf.kernel(grid2,block2,shm_size2,kernel2,args2); //kernel1runsbeforekernel2 task1.precede(task2); }); executor.run(taskflow).wait(); A cudaFlow is a task (tf::Task) created from tf::Taskflow and will be run by one worker thread in the executor. That is, the callable that describes a cudaFlow will be executed sequentially. Inside a cudaFlow task, different GPU tasks (tf::cudaTask) may run in parallel scheduled by the CUDA runtime. Please refer to GPU Tasking (cudaFlow) for details. tf::cudaFlow_cfg tf::cudaFlow_exe tf::cudaFlowcapture tf::cudaFlowcapture tf::cudaFlowclear tf::cudaFlowcopy tf::cudaFlowcopy tf::cudaFlowcudaFlow tf::cudaFlowcudaFlow tf::cudaFlowdump tf::cudaFlowdump_native_graph tf::cudaFlowempty tf::cudaFlowfill tf::cudaFlowfill tf::cudaFlowfor_each tf::cudaFlowfor_each tf::cudaFlowfor_each_index tf::cudaFlowfor_each_index tf::cudaFlowhost tf::cudaFlowhost tf::cudaFlowkernel tf::cudaFlowkernel tf::cudaFlowmemcpy tf::cudaFlowmemcpy tf::cudaFlowmemset tf::cudaFlowmemset tf::cudaFlownative_executable tf::cudaFlownative_graph tf::cudaFlownoop tf::cudaFlownum_tasks tf::cudaFlowoperator= tf::cudaFlowrun tf::cudaFlowsingle_task tf::cudaFlowsingle_task tf::cudaFlowtransform tf::cudaFlowtransform tf::cudaFlowtransform tf::cudaFlowtransform tf::cudaFlowzero tf::cudaFlowzero tf::cudaFlow~cudaFlow