mesytec-mnode/external/taskflow-3.8.0/docs/GPUTaskingcudaFlow.html
2025-01-04 01:25:05 +01:00

266 lines
38 KiB
HTML

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<title>Cookbook &raquo; GPU Tasking (cudaFlow) | Taskflow QuickStart</title>
<link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Source+Sans+Pro:400,400i,600,600i%7CSource+Code+Pro:400,400i,600" />
<link rel="stylesheet" href="m-dark+documentation.compiled.css" />
<link rel="icon" href="favicon.ico" type="image/vnd.microsoft.icon" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<meta name="theme-color" content="#22272e" />
</head>
<body>
<header><nav id="navigation">
<div class="m-container">
<div class="m-row">
<span id="m-navbar-brand" class="m-col-t-8 m-col-m-none m-left-m">
<a href="https://taskflow.github.io"><img src="taskflow_logo.png" alt="" />Taskflow</a> <span class="m-breadcrumb">|</span> <a href="index.html" class="m-thin">QuickStart</a>
</span>
<div class="m-col-t-4 m-hide-m m-text-right m-nopadr">
<a href="#search" class="m-doc-search-icon" title="Search" onclick="return showSearch()"><svg style="height: 0.9rem;" viewBox="0 0 16 16">
<path id="m-doc-search-icon-path" d="m6 0c-3.31 0-6 2.69-6 6 0 3.31 2.69 6 6 6 1.49 0 2.85-0.541 3.89-1.44-0.0164 0.338 0.147 0.759 0.5 1.15l3.22 3.79c0.552 0.614 1.45 0.665 2 0.115 0.55-0.55 0.499-1.45-0.115-2l-3.79-3.22c-0.392-0.353-0.812-0.515-1.15-0.5 0.895-1.05 1.44-2.41 1.44-3.89 0-3.31-2.69-6-6-6zm0 1.56a4.44 4.44 0 0 1 4.44 4.44 4.44 4.44 0 0 1-4.44 4.44 4.44 4.44 0 0 1-4.44-4.44 4.44 4.44 0 0 1 4.44-4.44z"/>
</svg></a>
<a id="m-navbar-show" href="#navigation" title="Show navigation"></a>
<a id="m-navbar-hide" href="#" title="Hide navigation"></a>
</div>
<div id="m-navbar-collapse" class="m-col-t-12 m-show-m m-col-m-none m-right-m">
<div class="m-row">
<ol class="m-col-t-6 m-col-m-none">
<li><a href="pages.html">Handbook</a></li>
<li><a href="namespaces.html">Namespaces</a></li>
</ol>
<ol class="m-col-t-6 m-col-m-none" start="3">
<li><a href="annotated.html">Classes</a></li>
<li><a href="files.html">Files</a></li>
<li class="m-show-m"><a href="#search" class="m-doc-search-icon" title="Search" onclick="return showSearch()"><svg style="height: 0.9rem;" viewBox="0 0 16 16">
<use href="#m-doc-search-icon-path" />
</svg></a></li>
</ol>
</div>
</div>
</div>
</div>
</nav></header>
<main><article>
<div class="m-container m-container-inflatable">
<div class="m-row">
<div class="m-col-l-10 m-push-l-1">
<h1>
<span class="m-breadcrumb"><a href="Cookbook.html">Cookbook</a> &raquo;</span>
GPU Tasking (cudaFlow)
</h1>
<nav class="m-block m-default">
<h3>Contents</h3>
<ul>
<li><a href="#GPUTaskingcudaFlowIncludeTheHeader">Include the Header</a></li>
<li><a href="#WhatIsACudaGraph">What is a CUDA Graph?</a></li>
<li><a href="#Create_a_cudaFlow">Create a cudaFlow</a></li>
<li><a href="#Compile_a_cudaFlow_program">Compile a cudaFlow Program</a></li>
<li><a href="#run_a_cudaflow_on_a_specific_gpu">Run a cudaFlow on Specific GPU</a></li>
<li><a href="#GPUMemoryOperations">Create Memory Operation Tasks</a></li>
<li><a href="#OffloadAcudaFlow">Offload a cudaFlow</a></li>
<li><a href="#UpdateAcudaFlow">Update a cudaFlow</a></li>
<li><a href="#IntegrateCudaFlowIntoTaskflow">Integrate a cudaFlow into Taskflow</a></li>
</ul>
</nav>
<p>Modern scientific computing typically leverages GPU-powered parallel processing cores to speed up large-scale applications. This chapter discusses how to implement CPU-GPU heterogeneous tasking algorithms with <a href="https://developer.nvidia.com/cuda-zone">Nvidia CUDA</a>.</p><section id="GPUTaskingcudaFlowIncludeTheHeader"><h2><a href="#GPUTaskingcudaFlowIncludeTheHeader">Include the Header</a></h2><p>You need to include the header file, <code>taskflow/cuda/cudaflow.hpp</code>, for creating a GPU task graph using <a href="classtf_1_1cudaFlow.html" class="m-doc">tf::<wbr />cudaFlow</a>.</p><pre class="m-code"><span class="cp">#include</span><span class="w"> </span><span class="cpf">&lt;taskflow/cuda/cudaflow.hpp&gt;</span></pre></section><section id="WhatIsACudaGraph"><h2><a href="#WhatIsACudaGraph">What is a CUDA Graph?</a></h2><p>CUDA Graph is a new execution model that enables a series of CUDA kernels to be defined and encapsulated as a single unit, i.e., a task graph of operations, rather than a sequence of individually-launched operations. This organization allows launching multiple GPU operations through a single CPU operation and hence reduces the launching overheads, especially for kernels of short running time. The benefit of CUDA Graph can be demonstrated in the figure below:</p><img class="m-image" src="cuda_graph_benefit.png" alt="Image" /><p>In this example, a sequence of short kernels is launched one-by-one by the CPU. The CPU launching overhead creates a significant gap in between the kernels. If we replace this sequence of kernels with a CUDA graph, initially we will need to spend a little extra time on building the graph and launching the whole graph in one go on the first occasion, but subsequent executions will be very fast, as there will be very little gap between the kernels. The difference is more pronounced when the same sequence of operations is repeated many times, for example, many training epochs in machine learning workloads. In that case, the initial costs of building and launching the graph will be amortized over the entire training iterations.</p><aside class="m-note m-info"><h4>Note</h4><p>A comprehensive introduction about CUDA Graph can be referred to the <a href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs">CUDA Graph Programming Guide</a>.</p></aside></section><section id="Create_a_cudaFlow"><h2><a href="#Create_a_cudaFlow">Create a cudaFlow</a></h2><p>Taskflow leverages <a href="https://developer.nvidia.com/blog/cuda-graphs/">CUDA Graph</a> to enable concurrent CPU-GPU tasking using a task graph model called <a href="classtf_1_1cudaFlow.html" class="m-doc">tf::<wbr />cudaFlow</a>. A cudaFlow manages a CUDA graph explicitly to execute dependent GPU operations in a single CPU call. The following example implements a cudaFlow that performs an saxpy (A·X Plus Y) workload:</p><pre class="m-code"><span class="cp">#include</span><span class="w"> </span><span class="cpf">&lt;taskflow/cuda/cudaflow.hpp&gt;</span>
<span class="c1">// saxpy (single-precision A·X Plus Y) kernel</span>
<span class="n">__global__</span><span class="w"> </span><span class="kt">void</span><span class="w"> </span><span class="n">saxpy</span><span class="p">(</span><span class="kt">int</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="kt">float</span><span class="w"> </span><span class="n">a</span><span class="p">,</span><span class="w"> </span><span class="kt">float</span><span class="w"> </span><span class="o">*</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="kt">float</span><span class="w"> </span><span class="o">*</span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="kt">int</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">blockIdx</span><span class="p">.</span><span class="n">x</span><span class="o">*</span><span class="n">blockDim</span><span class="p">.</span><span class="n">x</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">threadIdx</span><span class="p">.</span><span class="n">x</span><span class="p">;</span>
<span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">y</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">a</span><span class="o">*</span><span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">y</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
<span class="w"> </span><span class="p">}</span>
<span class="p">}</span>
<span class="c1">// main function begins</span>
<span class="kt">int</span><span class="w"> </span><span class="n">main</span><span class="p">()</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="k">const</span><span class="w"> </span><span class="kt">unsigned</span><span class="w"> </span><span class="n">N</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">1</span><span class="o">&lt;&lt;</span><span class="mi">20</span><span class="p">;</span><span class="w"> </span><span class="c1">// size of the vector</span>
<span class="w"> </span><span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="kt">float</span><span class="o">&gt;</span><span class="w"> </span><span class="n">hx</span><span class="p">(</span><span class="n">N</span><span class="p">,</span><span class="w"> </span><span class="mf">1.0f</span><span class="p">);</span><span class="w"> </span><span class="c1">// x vector at host</span>
<span class="w"> </span><span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o">&lt;</span><span class="kt">float</span><span class="o">&gt;</span><span class="w"> </span><span class="n">hy</span><span class="p">(</span><span class="n">N</span><span class="p">,</span><span class="w"> </span><span class="mf">2.0f</span><span class="p">);</span><span class="w"> </span><span class="c1">// y vector at host</span>
<span class="w"> </span><span class="kt">float</span><span class="w"> </span><span class="o">*</span><span class="n">dx</span><span class="p">{</span><span class="k">nullptr</span><span class="p">};</span><span class="w"> </span><span class="c1">// x vector at device</span>
<span class="w"> </span><span class="kt">float</span><span class="w"> </span><span class="o">*</span><span class="n">dy</span><span class="p">{</span><span class="k">nullptr</span><span class="p">};</span><span class="w"> </span><span class="c1">// y vector at device</span>
<span class="w"> </span>
<span class="w"> </span><span class="n">cudaMalloc</span><span class="p">(</span><span class="o">&amp;</span><span class="n">dx</span><span class="p">,</span><span class="w"> </span><span class="n">N</span><span class="o">*</span><span class="k">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">));</span>
<span class="w"> </span><span class="n">cudaMalloc</span><span class="p">(</span><span class="o">&amp;</span><span class="n">dy</span><span class="p">,</span><span class="w"> </span><span class="n">N</span><span class="o">*</span><span class="k">sizeof</span><span class="p">(</span><span class="kt">float</span><span class="p">));</span>
<span class="w"> </span><span class="n">tf</span><span class="o">::</span><span class="n">cudaFlow</span><span class="w"> </span><span class="n">cudaflow</span><span class="p">;</span>
<span class="w"> </span>
<span class="w"> </span><span class="c1">// create data transfer tasks</span>
<span class="w"> </span><span class="n">tf</span><span class="o">::</span><span class="n">cudaTask</span><span class="w"> </span><span class="n">h2d_x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cudaflow</span><span class="p">.</span><span class="n">copy</span><span class="p">(</span><span class="n">dx</span><span class="p">,</span><span class="w"> </span><span class="n">hx</span><span class="p">.</span><span class="n">data</span><span class="p">(),</span><span class="w"> </span><span class="n">N</span><span class="p">).</span><span class="n">name</span><span class="p">(</span><span class="s">&quot;h2d_x&quot;</span><span class="p">);</span><span class="w"> </span>
<span class="w"> </span><span class="n">tf</span><span class="o">::</span><span class="n">cudaTask</span><span class="w"> </span><span class="n">h2d_y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cudaflow</span><span class="p">.</span><span class="n">copy</span><span class="p">(</span><span class="n">dy</span><span class="p">,</span><span class="w"> </span><span class="n">hy</span><span class="p">.</span><span class="n">data</span><span class="p">(),</span><span class="w"> </span><span class="n">N</span><span class="p">).</span><span class="n">name</span><span class="p">(</span><span class="s">&quot;h2d_y&quot;</span><span class="p">);</span>
<span class="w"> </span><span class="n">tf</span><span class="o">::</span><span class="n">cudaTask</span><span class="w"> </span><span class="n">d2h_x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cudaflow</span><span class="p">.</span><span class="n">copy</span><span class="p">(</span><span class="n">hx</span><span class="p">.</span><span class="n">data</span><span class="p">(),</span><span class="w"> </span><span class="n">dx</span><span class="p">,</span><span class="w"> </span><span class="n">N</span><span class="p">).</span><span class="n">name</span><span class="p">(</span><span class="s">&quot;d2h_x&quot;</span><span class="p">);</span>
<span class="w"> </span><span class="n">tf</span><span class="o">::</span><span class="n">cudaTask</span><span class="w"> </span><span class="n">d2h_y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cudaflow</span><span class="p">.</span><span class="n">copy</span><span class="p">(</span><span class="n">hy</span><span class="p">.</span><span class="n">data</span><span class="p">(),</span><span class="w"> </span><span class="n">dy</span><span class="p">,</span><span class="w"> </span><span class="n">N</span><span class="p">).</span><span class="n">name</span><span class="p">(</span><span class="s">&quot;d2h_y&quot;</span><span class="p">);</span>
<span class="w"> </span><span class="c1">// launch saxpy&lt;&lt;&lt;(N+255)/256, 256, 0&gt;&gt;&gt;(N, 2.0f, dx, dy)</span>
<span class="w"> </span><span class="n">tf</span><span class="o">::</span><span class="n">cudaTask</span><span class="w"> </span><span class="n">kernel</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cudaflow</span><span class="p">.</span><span class="n">kernel</span><span class="p">(</span>
<span class="w"> </span><span class="p">(</span><span class="n">N</span><span class="o">+</span><span class="mi">255</span><span class="p">)</span><span class="o">/</span><span class="mi">256</span><span class="p">,</span><span class="w"> </span><span class="mi">256</span><span class="p">,</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w"> </span><span class="n">saxpy</span><span class="p">,</span><span class="w"> </span><span class="n">N</span><span class="p">,</span><span class="w"> </span><span class="mf">2.0f</span><span class="p">,</span><span class="w"> </span><span class="n">dx</span><span class="p">,</span><span class="w"> </span><span class="n">dy</span>
<span class="w"> </span><span class="p">).</span><span class="n">name</span><span class="p">(</span><span class="s">&quot;saxpy&quot;</span><span class="p">);</span>
<span class="w"> </span><span class="n">kernel</span><span class="p">.</span><span class="n">succeed</span><span class="p">(</span><span class="n">h2d_x</span><span class="p">,</span><span class="w"> </span><span class="n">h2d_y</span><span class="p">)</span>
<span class="w"> </span><span class="p">.</span><span class="n">precede</span><span class="p">(</span><span class="n">d2h_x</span><span class="p">,</span><span class="w"> </span><span class="n">d2h_y</span><span class="p">);</span>
<span class="w"> </span><span class="c1">// run the cudaflow through a stream</span>
<span class="w"> </span><span class="n">tf</span><span class="o">::</span><span class="n">cudaStream</span><span class="w"> </span><span class="n">stream</span><span class="p">;</span>
<span class="w"> </span><span class="n">cudaflow</span><span class="p">.</span><span class="n">run</span><span class="p">(</span><span class="n">stream</span><span class="p">)</span>
<span class="w"> </span><span class="n">stream</span><span class="p">.</span><span class="n">synchronize</span><span class="p">();</span>
<span class="w"> </span>
<span class="w"> </span><span class="c1">// dump the cudaflow</span>
<span class="w"> </span><span class="n">cudaflow</span><span class="p">.</span><span class="n">dump</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">cout</span><span class="p">);</span>
<span class="p">}</span></pre><p>The cudaFlow graph consists of two CPU-to-GPU data copies (<code>h2d_x</code> and <code>h2d_y</code>), one kernel (<code>saxpy</code>), and two GPU-to-CPU data copies (<code>d2h_x</code> and <code>d2h_y</code>), in this order of their task dependencies.</p><div class="m-graph"><svg style="width: 24.200rem; height: 9.800rem;" viewBox="0.00 0.00 242.31 98.00">
<g transform="scale(1 1) rotate(0) translate(4 94)">
<title>Taskflow</title>
<g class="m-node m-flat">
<title>p0x7f2870401a50</title>
<ellipse cx="27.08" cy="-72" rx="27.16" ry="18"/>
<text text-anchor="middle" x="27.08" y="-69.5" font-family="Helvetica,sans-Serif" font-size="10.00">h2d_x</text>
</g>
<g class="m-node">
<title>p0x7f2870402bc0</title>
<polygon points="144.16,-63 94.16,-63 90.16,-59 90.16,-27 140.16,-27 144.16,-31 144.16,-63"/>
<polyline points="140.16,-59 90.16,-59 "/>
<polyline points="140.16,-59 140.16,-27 "/>
<polyline points="140.16,-59 144.16,-63 "/>
<text text-anchor="middle" x="117.16" y="-42.5" font-family="Helvetica,sans-Serif" font-size="10.00" fill="white">saxpy</text>
</g>
<g class="m-edge">
<title>p0x7f2870401a50&#45;&gt;p0x7f2870402bc0</title>
<path d="M52.15,-64.62C60.84,-61.96 70.85,-58.89 80.34,-55.98"/>
<polygon points="81.42,-59.31 89.95,-53.03 79.37,-52.62 81.42,-59.31"/>
</g>
<g class="m-node m-flat">
<title>p0x7f2870402310</title>
<ellipse cx="207.24" cy="-72" rx="27.16" ry="18"/>
<text text-anchor="middle" x="207.24" y="-69.5" font-family="Helvetica,sans-Serif" font-size="10.00">d2h_x</text>
</g>
<g class="m-edge">
<title>p0x7f2870402bc0&#45;&gt;p0x7f2870402310</title>
<path d="M144.58,-53.1C153.37,-55.79 163.27,-58.83 172.52,-61.67"/>
<polygon points="171.64,-65.06 182.23,-64.64 173.69,-58.36 171.64,-65.06"/>
</g>
<g class="m-node m-flat">
<title>p0x7f2870402780</title>
<ellipse cx="207.24" cy="-18" rx="27.16" ry="18"/>
<text text-anchor="middle" x="207.24" y="-15.5" font-family="Helvetica,sans-Serif" font-size="10.00">d2h_y</text>
</g>
<g class="m-edge">
<title>p0x7f2870402bc0&#45;&gt;p0x7f2870402780</title>
<path d="M144.58,-36.9C153.37,-34.21 163.27,-31.17 172.52,-28.33"/>
<polygon points="173.69,-31.64 182.23,-25.36 171.64,-24.94 173.69,-31.64"/>
</g>
<g class="m-node m-flat">
<title>p0x7f2870401eb0</title>
<ellipse cx="27.08" cy="-18" rx="27.16" ry="18"/>
<text text-anchor="middle" x="27.08" y="-15.5" font-family="Helvetica,sans-Serif" font-size="10.00">h2d_y</text>
</g>
<g class="m-edge">
<title>p0x7f2870401eb0&#45;&gt;p0x7f2870402bc0</title>
<path d="M52.15,-25.38C60.84,-28.04 70.85,-31.11 80.34,-34.02"/>
<polygon points="79.37,-37.38 89.95,-36.97 81.42,-30.69 79.37,-37.38"/>
</g>
</g>
</svg>
</div><p>We do not expend yet another effort on simplifying kernel programming but focus on tasking CUDA operations and their dependencies. In other words, <a href="classtf_1_1cudaFlow.html" class="m-doc">tf::<wbr />cudaFlow</a> is a lightweight C++ abstraction over CUDA Graph. This organization lets users fully take advantage of CUDA features that are commensurate with their domain knowledge, while leaving difficult task parallelism details to Taskflow.</p></section><section id="Compile_a_cudaFlow_program"><h2><a href="#Compile_a_cudaFlow_program">Compile a cudaFlow Program</a></h2><p>Use <a href="https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html">nvcc</a> to compile a cudaFlow program:</p><pre class="m-console"><span class="go">~$ nvcc -std=c++17 my_cudaflow.cu -I path/to/include/taskflow -O2 -o my_cudaflow</span>
<span class="go">~$ ./my_cudaflow</span></pre><p>Please visit the page <a href="CompileTaskflowWithCUDA.html" class="m-doc">Compile Taskflow with CUDA</a> for more details.</p></section><section id="run_a_cudaflow_on_a_specific_gpu"><h2><a href="#run_a_cudaflow_on_a_specific_gpu">Run a cudaFlow on Specific GPU</a></h2><p>By default, a cudaFlow runs on the current GPU context associated with the caller, which is typically GPU <code>0</code>. Each CUDA GPU has an integer identifier in the range of <code>[0, N)</code> to represent the context of that GPU, where <code>N</code> is the number of GPUs in the system. You can run a cudaFlow on a specific GPU by switching the context to a different GPU using <a href="classtf_1_1cudaScopedDevice.html" class="m-doc">tf::<wbr />cudaScopedDevice</a>. The code below creates a cudaFlow and runs it on GPU <code>2</code>.</p><pre class="m-code"><span class="p">{</span>
<span class="w"> </span><span class="c1">// create an RAII-styled switcher to the context of GPU 2</span>
<span class="w"> </span><span class="n">tf</span><span class="o">::</span><span class="n">cudaScopedDevice</span><span class="w"> </span><span class="nf">context</span><span class="p">(</span><span class="mi">2</span><span class="p">);</span>
<span class="w"> </span><span class="c1">// create a cudaFlow capturer under GPU 2</span>
<span class="w"> </span><span class="n">tf</span><span class="o">::</span><span class="n">cudaFlowCapturer</span><span class="w"> </span><span class="n">capturer</span><span class="p">;</span>
<span class="w"> </span><span class="c1">// ...</span>
<span class="w"> </span><span class="c1">// create a stream under GPU 2 and offload the capturer to that GPU</span>
<span class="w"> </span><span class="n">tf</span><span class="o">::</span><span class="n">cudaStream</span><span class="w"> </span><span class="n">stream</span><span class="p">;</span>
<span class="w"> </span><span class="n">capturer</span><span class="p">.</span><span class="n">run</span><span class="p">(</span><span class="n">stream</span><span class="p">);</span>
<span class="w"> </span><span class="n">stream</span><span class="p">.</span><span class="n">synchronize</span><span class="p">();</span>
<span class="p">}</span></pre><p><a href="classtf_1_1cudaScopedDevice.html" class="m-doc">tf::<wbr />cudaScopedDevice</a> is an RAII-styled wrapper to perform <em>scoped</em> switch to the given GPU context. When the scope is destroyed, it switches back to the original context.</p><aside class="m-note m-warning"><h4>Attention</h4><p><a href="classtf_1_1cudaScopedDevice.html" class="m-doc">tf::<wbr />cudaScopedDevice</a> allows you to place a cudaFlow on a particular GPU device, but it is your responsibility to ensure correct memory access. For example, you may not allocate a memory block on GPU <code>2</code> while accessing it from a kernel on GPU <code>0</code>. An easy practice for multi-GPU programming is to allocate <em>unified shared memory</em> using <code>cudaMallocManaged</code> and let the CUDA runtime perform automatic memory migration between GPUs.</p></aside></section><section id="GPUMemoryOperations"><h2><a href="#GPUMemoryOperations">Create Memory Operation Tasks</a></h2><p>cudaFlow provides a set of methods for users to manipulate device memory. There are two categories, <em>raw</em> data and <em>typed</em> data. Raw data operations are methods with prefix <code>mem</code>, such as <code>memcpy</code> and <code>memset</code>, that operate in <em>bytes</em>. Typed data operations such as <code>copy</code>, <code>fill</code>, and <code>zero</code>, take <em>logical count</em> of elements. For instance, the following three methods have the same result of zeroing <code>sizeof(int)*count</code> bytes of the device memory area pointed to by <code>target</code>.</p><pre class="m-code"><span class="kt">int</span><span class="o">*</span><span class="w"> </span><span class="n">target</span><span class="p">;</span>
<span class="n">cudaMalloc</span><span class="p">(</span><span class="o">&amp;</span><span class="n">target</span><span class="p">,</span><span class="w"> </span><span class="n">count</span><span class="o">*</span><span class="k">sizeof</span><span class="p">(</span><span class="kt">int</span><span class="p">));</span>
<span class="n">tf</span><span class="o">::</span><span class="n">cudaFlow</span><span class="w"> </span><span class="n">cudaflow</span><span class="p">;</span>
<span class="n">memset_target</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cudaflow</span><span class="p">.</span><span class="n">memset</span><span class="p">(</span><span class="n">target</span><span class="p">,</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w"> </span><span class="k">sizeof</span><span class="p">(</span><span class="kt">int</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">count</span><span class="p">);</span>
<span class="n">same_as_above</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cudaflow</span><span class="p">.</span><span class="n">fill</span><span class="p">(</span><span class="n">target</span><span class="p">,</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w"> </span><span class="n">count</span><span class="p">);</span>
<span class="n">same_as_above_again</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cudaflow</span><span class="p">.</span><span class="n">zero</span><span class="p">(</span><span class="n">target</span><span class="p">,</span><span class="w"> </span><span class="n">count</span><span class="p">);</span></pre><p>The method <a href="classtf_1_1cudaFlow.html#a21d4447bc834f4d3e1bb4772c850d090" class="m-doc">tf::<wbr />cudaFlow::<wbr />fill</a> is a more powerful variant of <a href="classtf_1_1cudaFlow.html#a079ca65da35301e5aafd45878a19e9d2" class="m-doc">tf::<wbr />cudaFlow::<wbr />memset</a>. It can fill a memory area with any value of type <code>T</code>, given that <code>sizeof(T)</code> is 1, 2, or 4 bytes. The following example creates a GPU task to fill <code>count</code> elements in the array <code>target</code> with value <code>1234</code>.</p><pre class="m-code"><span class="n">cf</span><span class="p">.</span><span class="n">fill</span><span class="p">(</span><span class="n">target</span><span class="p">,</span><span class="w"> </span><span class="mi">1234</span><span class="p">,</span><span class="w"> </span><span class="n">count</span><span class="p">);</span></pre><p>Similar concept applies to <a href="classtf_1_1cudaFlow.html#ad37637606f0643f360e9eda1f9a6e559" class="m-doc">tf::<wbr />cudaFlow::<wbr />memcpy</a> and <a href="classtf_1_1cudaFlow.html#af03e04771b655f9e629eb4c22e19b19f" class="m-doc">tf::<wbr />cudaFlow::<wbr />copy</a> as well. The following two methods are equivalent to each other.</p><pre class="m-code"><span class="n">cudaflow</span><span class="p">.</span><span class="n">memcpy</span><span class="p">(</span><span class="n">target</span><span class="p">,</span><span class="w"> </span><span class="n">source</span><span class="p">,</span><span class="w"> </span><span class="k">sizeof</span><span class="p">(</span><span class="kt">int</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">count</span><span class="p">);</span>
<span class="n">cudaflow</span><span class="p">.</span><span class="n">copy</span><span class="p">(</span><span class="n">target</span><span class="p">,</span><span class="w"> </span><span class="n">source</span><span class="p">,</span><span class="w"> </span><span class="n">count</span><span class="p">);</span></pre></section><section id="OffloadAcudaFlow"><h2><a href="#OffloadAcudaFlow">Offload a cudaFlow</a></h2><p>To offload a cudaFlow to a GPU, you need to use <a href="classtf_1_1cudaFlow.html#ae6810f7de27e5a347331aacfce67bea1" class="m-doc">tf::<wbr />cudaFlow::<wbr />run</a> and pass a <a href="classtf_1_1cudaStream.html" class="m-doc">tf::<wbr />cudaStream</a> created on that GPU. The run method is asynchronous and can be explicitly synchronized through the given stream.</p><pre class="m-code"><span class="n">tf</span><span class="o">::</span><span class="n">cudaStream</span><span class="w"> </span><span class="n">stream</span><span class="p">;</span>
<span class="c1">// launch a cudaflow asynchronously through a stream</span>
<span class="n">cudaflow</span><span class="p">.</span><span class="n">run</span><span class="p">(</span><span class="n">stream</span><span class="p">);</span>
<span class="c1">// wait for the cudaflow to finish</span>
<span class="n">stream</span><span class="p">.</span><span class="n">synchronize</span><span class="p">();</span></pre><p>When you offload a cudaFlow using <a href="classtf_1_1cudaFlow.html#ae6810f7de27e5a347331aacfce67bea1" class="m-doc">tf::<wbr />cudaFlow::<wbr />run</a>, the runtime transforms that cudaFlow (i.e., application GPU task graph) into a native executable instance and submit it to the CUDA runtime for execution. There is always an one-to-one mapping between cudaFlow and its native CUDA graph representation (except those constructed by using <a href="classtf_1_1cudaFlowCapturer.html" class="m-doc">tf::<wbr />cudaFlowCapturer</a>).</p></section><section id="UpdateAcudaFlow"><h2><a href="#UpdateAcudaFlow">Update a cudaFlow</a></h2><p>Many GPU applications require you to launch a cudaFlow multiple times and update node parameters (e.g., kernel parameters and memory addresses) between iterations. cudaFlow allows you to update the parameters of created tasks and run the updated cudaFlow with new parameters. Every task-creation method in <a href="classtf_1_1cudaFlow.html" class="m-doc">tf::<wbr />cudaFlow</a> has an overload to update the parameters of a created task by that method.</p><pre class="m-code"><span class="n">tf</span><span class="o">::</span><span class="n">cudaStream</span><span class="w"> </span><span class="n">stream</span><span class="p">;</span>
<span class="n">tf</span><span class="o">::</span><span class="n">cudaFlow</span><span class="w"> </span><span class="n">cf</span><span class="p">;</span>
<span class="c1">// create a kernel task</span>
<span class="n">tf</span><span class="o">::</span><span class="n">cudaTask</span><span class="w"> </span><span class="n">task</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cf</span><span class="p">.</span><span class="n">kernel</span><span class="p">(</span><span class="n">grid1</span><span class="p">,</span><span class="w"> </span><span class="n">block1</span><span class="p">,</span><span class="w"> </span><span class="n">shm1</span><span class="p">,</span><span class="w"> </span><span class="n">kernel</span><span class="p">,</span><span class="w"> </span><span class="n">kernel_args_1</span><span class="p">);</span>
<span class="n">cf</span><span class="p">.</span><span class="n">run</span><span class="p">(</span><span class="n">stream</span><span class="p">);</span>
<span class="n">stream</span><span class="p">.</span><span class="n">synchronize</span><span class="p">();</span>
<span class="c1">// update the created kernel task with different parameters</span>
<span class="n">cf</span><span class="p">.</span><span class="n">kernel</span><span class="p">(</span><span class="n">task</span><span class="p">,</span><span class="w"> </span><span class="n">grid2</span><span class="p">,</span><span class="w"> </span><span class="n">block2</span><span class="p">,</span><span class="w"> </span><span class="n">shm2</span><span class="p">,</span><span class="w"> </span><span class="n">kernel</span><span class="p">,</span><span class="w"> </span><span class="n">kernel_args_2</span><span class="p">);</span>
<span class="n">cf</span><span class="p">.</span><span class="n">run</span><span class="p">(</span><span class="n">stream</span><span class="p">);</span>
<span class="n">stream</span><span class="p">.</span><span class="n">synchronize</span><span class="p">();</span></pre><p>Between successive offloads (i.e., iterative executions of a cudaFlow), you can <em>ONLY</em> update task parameters, such as changing the kernel execution parameters and memory operation parameters. However, you must <em>NOT</em> change the topology of the cudaFlow, such as adding a new task or adding a new dependency. This is the limitation of CUDA Graph.</p><aside class="m-note m-warning"><h4>Attention</h4><p>There are a few restrictions on updating task parameters in a cudaFlow. Notably, you must <em>NOT</em> change the topology of an offloaded graph. In addition, update methods have the following limitations:</p><ul><li>kernel task<ul><li>The kernel function is not allowed to change. This restriction applies to all algorithm tasks that are created using lambda.</li></ul></li><li>memset and memcpy tasks:<ul><li>The CUDA device(s) to which the operand(s) was allocated/mapped cannot change</li><li>The source/destination memory must be allocated from the same contexts as the original source/destination memory.</li></ul></li></ul></aside></section><section id="IntegrateCudaFlowIntoTaskflow"><h2><a href="#IntegrateCudaFlowIntoTaskflow">Integrate a cudaFlow into Taskflow</a></h2><p>You can create a task to enclose a cudaFlow and run it from a worker thread. The usage of the cudaFlow remains the same except that the cudaFlow is run by a worker thread from a taskflow task. The following example runs a cudaFlow from a static task:</p><pre class="m-code"><span class="n">tf</span><span class="o">::</span><span class="n">Executor</span><span class="w"> </span><span class="n">executor</span><span class="p">;</span>
<span class="n">tf</span><span class="o">::</span><span class="n">Taskflow</span><span class="w"> </span><span class="n">taskflow</span><span class="p">;</span>
<span class="n">taskflow</span><span class="p">.</span><span class="n">emplace</span><span class="p">([](){</span>
<span class="w"> </span><span class="c1">// create a cudaFlow inside a static task</span>
<span class="w"> </span><span class="n">tf</span><span class="o">::</span><span class="n">cudaFlow</span><span class="w"> </span><span class="n">cudaflow</span><span class="p">;</span>
<span class="w"> </span><span class="c1">// ... create a kernel task</span>
<span class="w"> </span><span class="n">cudaflow</span><span class="p">.</span><span class="n">kernel</span><span class="p">(...);</span>
<span class="w"> </span>
<span class="w"> </span><span class="c1">// run the capturer through a stream</span>
<span class="w"> </span><span class="n">tf</span><span class="o">::</span><span class="n">cudaStream</span><span class="w"> </span><span class="n">stream</span><span class="p">;</span>
<span class="w"> </span><span class="n">capturer</span><span class="p">.</span><span class="n">run</span><span class="p">(</span><span class="n">stream</span><span class="p">);</span>
<span class="w"> </span><span class="n">stream</span><span class="p">.</span><span class="n">synchronize</span><span class="p">();</span>
<span class="p">});</span></pre></section>
</div>
</div>
</div>
</article></main>
<div class="m-doc-search" id="search">
<a href="#!" onclick="return hideSearch()"></a>
<div class="m-container">
<div class="m-row">
<div class="m-col-m-8 m-push-m-2">
<div class="m-doc-search-header m-text m-small">
<div><span class="m-label m-default">Tab</span> / <span class="m-label m-default">T</span> to search, <span class="m-label m-default">Esc</span> to close</div>
<div id="search-symbolcount">&hellip;</div>
</div>
<div class="m-doc-search-content">
<form>
<input type="search" name="q" id="search-input" placeholder="Loading &hellip;" disabled="disabled" autofocus="autofocus" autocomplete="off" spellcheck="false" />
</form>
<noscript class="m-text m-danger m-text-center">Unlike everything else in the docs, the search functionality <em>requires</em> JavaScript.</noscript>
<div id="search-help" class="m-text m-dim m-text-center">
<p class="m-noindent">Search for symbols, directories, files, pages or
modules. You can omit any prefix from the symbol or file path; adding a
<code>:</code> or <code>/</code> suffix lists all members of given symbol or
directory.</p>
<p class="m-noindent">Use <span class="m-label m-dim">&darr;</span>
/ <span class="m-label m-dim">&uarr;</span> to navigate through the list,
<span class="m-label m-dim">Enter</span> to go.
<span class="m-label m-dim">Tab</span> autocompletes common prefix, you can
copy a link to the result using <span class="m-label m-dim"></span>
<span class="m-label m-dim">L</span> while <span class="m-label m-dim"></span>
<span class="m-label m-dim">M</span> produces a Markdown link.</p>
</div>
<div id="search-notfound" class="m-text m-warning m-text-center">Sorry, nothing was found.</div>
<ul id="search-results"></ul>
</div>
</div>
</div>
</div>
</div>
<script src="search-v2.js"></script>
<script src="searchdata-v2.js" async="async"></script>
<footer><nav>
<div class="m-container">
<div class="m-row">
<div class="m-col-l-10 m-push-l-1">
<p>Taskflow handbook is part of the <a href="https://taskflow.github.io">Taskflow project</a>, copyright © <a href="https://tsung-wei-huang.github.io/">Dr. Tsung-Wei Huang</a>, 2018&ndash;2024.<br />Generated by <a href="https://doxygen.org/">Doxygen</a> 1.9.1 and <a href="https://mcss.mosra.cz/">m.css</a>.</p>
</div>
</div>
</div>
</nav></footer>
</body>
</html>