139 lines
14 KiB
HTML
139 lines
14 KiB
HTML
|
<!DOCTYPE html>
|
||
|
<html lang="en">
|
||
|
<head>
|
||
|
<meta charset="UTF-8" />
|
||
|
<title>CUDA Standard Algorithms » Parallel Iterations | Taskflow QuickStart</title>
|
||
|
<link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Source+Sans+Pro:400,400i,600,600i%7CSource+Code+Pro:400,400i,600" />
|
||
|
<link rel="stylesheet" href="m-dark+documentation.compiled.css" />
|
||
|
<link rel="icon" href="favicon.ico" type="image/vnd.microsoft.icon" />
|
||
|
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
||
|
<meta name="theme-color" content="#22272e" />
|
||
|
</head>
|
||
|
<body>
|
||
|
<header><nav id="navigation">
|
||
|
<div class="m-container">
|
||
|
<div class="m-row">
|
||
|
<span id="m-navbar-brand" class="m-col-t-8 m-col-m-none m-left-m">
|
||
|
<a href="https://taskflow.github.io"><img src="taskflow_logo.png" alt="" />Taskflow</a> <span class="m-breadcrumb">|</span> <a href="index.html" class="m-thin">QuickStart</a>
|
||
|
</span>
|
||
|
<div class="m-col-t-4 m-hide-m m-text-right m-nopadr">
|
||
|
<a href="#search" class="m-doc-search-icon" title="Search" onclick="return showSearch()"><svg style="height: 0.9rem;" viewBox="0 0 16 16">
|
||
|
<path id="m-doc-search-icon-path" d="m6 0c-3.31 0-6 2.69-6 6 0 3.31 2.69 6 6 6 1.49 0 2.85-0.541 3.89-1.44-0.0164 0.338 0.147 0.759 0.5 1.15l3.22 3.79c0.552 0.614 1.45 0.665 2 0.115 0.55-0.55 0.499-1.45-0.115-2l-3.79-3.22c-0.392-0.353-0.812-0.515-1.15-0.5 0.895-1.05 1.44-2.41 1.44-3.89 0-3.31-2.69-6-6-6zm0 1.56a4.44 4.44 0 0 1 4.44 4.44 4.44 4.44 0 0 1-4.44 4.44 4.44 4.44 0 0 1-4.44-4.44 4.44 4.44 0 0 1 4.44-4.44z"/>
|
||
|
</svg></a>
|
||
|
<a id="m-navbar-show" href="#navigation" title="Show navigation"></a>
|
||
|
<a id="m-navbar-hide" href="#" title="Hide navigation"></a>
|
||
|
</div>
|
||
|
<div id="m-navbar-collapse" class="m-col-t-12 m-show-m m-col-m-none m-right-m">
|
||
|
<div class="m-row">
|
||
|
<ol class="m-col-t-6 m-col-m-none">
|
||
|
<li><a href="pages.html">Handbook</a></li>
|
||
|
<li><a href="namespaces.html">Namespaces</a></li>
|
||
|
</ol>
|
||
|
<ol class="m-col-t-6 m-col-m-none" start="3">
|
||
|
<li><a href="annotated.html">Classes</a></li>
|
||
|
<li><a href="files.html">Files</a></li>
|
||
|
<li class="m-show-m"><a href="#search" class="m-doc-search-icon" title="Search" onclick="return showSearch()"><svg style="height: 0.9rem;" viewBox="0 0 16 16">
|
||
|
<use href="#m-doc-search-icon-path" />
|
||
|
</svg></a></li>
|
||
|
</ol>
|
||
|
</div>
|
||
|
</div>
|
||
|
</div>
|
||
|
</div>
|
||
|
</nav></header>
|
||
|
<main><article>
|
||
|
<div class="m-container m-container-inflatable">
|
||
|
<div class="m-row">
|
||
|
<div class="m-col-l-10 m-push-l-1">
|
||
|
<h1>
|
||
|
<span class="m-breadcrumb"><a href="cudaStandardAlgorithms.html">CUDA Standard Algorithms</a> »</span>
|
||
|
Parallel Iterations
|
||
|
</h1>
|
||
|
<nav class="m-block m-default">
|
||
|
<h3>Contents</h3>
|
||
|
<ul>
|
||
|
<li><a href="#CUDASTDParallelIterationIncludeTheHeader">Include the Header</a></li>
|
||
|
<li><a href="#CUDASTDIndexBasedParallelFor">Index-based Parallel Iterations</a></li>
|
||
|
<li><a href="#CUDASTDIteratorBasedParallelFor">Iterator-based Parallel Iterations</a></li>
|
||
|
</ul>
|
||
|
</nav>
|
||
|
<p>Taskflow provides standard template methods for performing parallel iterations over a range of items a CUDA GPU.</p><section id="CUDASTDParallelIterationIncludeTheHeader"><h2><a href="#CUDASTDParallelIterationIncludeTheHeader">Include the Header</a></h2><p>You need to include the header file, <code>taskflow/cuda/algorithm/for_each.hpp</code>, for using the parallel-iteration algorithm.</p><pre class="m-code"><span class="cp">#include</span><span class="w"> </span><span class="cpf"><taskflow/cuda/algorithm/for_each.hpp></span></pre></section><section id="CUDASTDIndexBasedParallelFor"><h2><a href="#CUDASTDIndexBasedParallelFor">Index-based Parallel Iterations</a></h2><p>Index-based parallel-for performs parallel iterations over a range <code>[first, last)</code> with the given <code>step</code> size. The task created by <a href="namespacetf.html#a01ad7ce62fa6f42f2f2fbff3659b7884" class="m-doc">tf::<wbr />cuda_for_each_index</a> represents a kernel of parallel execution for the following loop:</p><pre class="m-code"><span class="c1">// positive step: first, first+step, first+2*step, ...</span>
|
||
|
<span class="k">for</span><span class="p">(</span><span class="k">auto</span><span class="w"> </span><span class="n">i</span><span class="o">=</span><span class="n">first</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="o"><</span><span class="n">last</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="o">+=</span><span class="n">step</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
|
||
|
<span class="w"> </span><span class="n">callable</span><span class="p">(</span><span class="n">i</span><span class="p">);</span>
|
||
|
<span class="p">}</span>
|
||
|
<span class="c1">// negative step: first, first-step, first-2*step, ...</span>
|
||
|
<span class="k">for</span><span class="p">(</span><span class="k">auto</span><span class="w"> </span><span class="n">i</span><span class="o">=</span><span class="n">first</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="o">></span><span class="n">last</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="o">+=</span><span class="n">step</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
|
||
|
<span class="w"> </span><span class="n">callable</span><span class="p">(</span><span class="n">i</span><span class="p">);</span>
|
||
|
<span class="p">}</span></pre><p>Each iteration <code>i</code> is independent of each other and is assigned one kernel thread to run the callable. The following example creates a kernel that assigns each entry of <code>data</code> to 1 over the range [0, 100) with step size 1.</p><pre class="m-code"><span class="n">tf</span><span class="o">::</span><span class="n">cudaDefaultExecutionPolicy</span><span class="w"> </span><span class="n">policy</span><span class="p">;</span>
|
||
|
<span class="k">auto</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tf</span><span class="o">::</span><span class="n">cuda_malloc_shared</span><span class="o"><</span><span class="kt">int</span><span class="o">></span><span class="p">(</span><span class="mi">100</span><span class="p">);</span>
|
||
|
|
||
|
<span class="c1">// assigns each element in data to 1 over the range [0, 100) with step size 1</span>
|
||
|
<span class="n">tf</span><span class="o">::</span><span class="n">cuda_for_each_index</span><span class="p">(</span>
|
||
|
<span class="w"> </span><span class="n">policy</span><span class="p">,</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w"> </span><span class="mi">100</span><span class="p">,</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="p">[</span><span class="n">data</span><span class="p">]</span><span class="w"> </span><span class="n">__device__</span><span class="w"> </span><span class="p">(</span><span class="kt">int</span><span class="w"> </span><span class="n">idx</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">data</span><span class="p">[</span><span class="n">idx</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span><span class="w"> </span><span class="p">}</span>
|
||
|
<span class="p">);</span>
|
||
|
|
||
|
<span class="c1">// synchronize the execution</span>
|
||
|
<span class="n">policy</span><span class="p">.</span><span class="n">synchronize</span><span class="p">();</span></pre><p>The parallel-iteration algorithm runs <em>asynchronously</em> through the stream specified in the execution policy. You need to synchronize the stream to obtain correct results.</p></section><section id="CUDASTDIteratorBasedParallelFor"><h2><a href="#CUDASTDIteratorBasedParallelFor">Iterator-based Parallel Iterations</a></h2><p>Iterator-based parallel-for performs parallel iterations over a range specified by two STL-styled iterators, <code>first</code> and <code>last</code>. The task created by <a href="namespacetf.html#a7c449cec0b93503b8280d05add35e9f4" class="m-doc">tf::<wbr />cuda_for_each</a> represents a parallel execution of the following loop:</p><pre class="m-code"><span class="k">for</span><span class="p">(</span><span class="k">auto</span><span class="w"> </span><span class="n">i</span><span class="o">=</span><span class="n">first</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="o"><</span><span class="n">last</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="o">++</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
|
||
|
<span class="w"> </span><span class="n">callable</span><span class="p">(</span><span class="o">*</span><span class="n">i</span><span class="p">);</span>
|
||
|
<span class="p">}</span></pre><p>The two iterators, <code>first</code> and <code>last</code>, are typically two raw pointers to the first element and the next to the last element in the range in GPU memory space. The following example creates a <code>for_each</code> kernel that assigns each element in <code>gpu_data</code> to 1 over the range <code>[data, data + 1000)</code>.</p><pre class="m-code"><span class="n">tf</span><span class="o">::</span><span class="n">cudaDefaultExecutionPolicy</span><span class="w"> </span><span class="n">policy</span><span class="p">;</span>
|
||
|
<span class="k">auto</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tf</span><span class="o">::</span><span class="n">cuda_malloc_shared</span><span class="o"><</span><span class="kt">int</span><span class="o">></span><span class="p">(</span><span class="mi">1000</span><span class="p">);</span>
|
||
|
|
||
|
<span class="c1">// assigns each element in data to 1 over the range [0, 1000) with step size 1</span>
|
||
|
<span class="n">tf</span><span class="o">::</span><span class="n">cuda_for_each</span><span class="p">(</span>
|
||
|
<span class="w"> </span><span class="n">policy</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mi">1000</span><span class="p">,</span><span class="w"> </span><span class="p">[]</span><span class="w"> </span><span class="n">__device__</span><span class="w"> </span><span class="p">(</span><span class="kt">int</span><span class="o">&</span><span class="w"> </span><span class="n">item</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">item</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span><span class="w"> </span><span class="p">}</span>
|
||
|
<span class="p">);</span>
|
||
|
|
||
|
<span class="c1">// synchronize the execution</span>
|
||
|
<span class="n">policy</span><span class="p">.</span><span class="n">synchronize</span><span class="p">();</span></pre><p>Each iteration is independent of each other and is assigned one kernel thread to run the callable. Since the callable runs on GPU, it must be declared with a <code>__device__</code> specifier.</p></section>
|
||
|
</div>
|
||
|
</div>
|
||
|
</div>
|
||
|
</article></main>
|
||
|
<div class="m-doc-search" id="search">
|
||
|
<a href="#!" onclick="return hideSearch()"></a>
|
||
|
<div class="m-container">
|
||
|
<div class="m-row">
|
||
|
<div class="m-col-m-8 m-push-m-2">
|
||
|
<div class="m-doc-search-header m-text m-small">
|
||
|
<div><span class="m-label m-default">Tab</span> / <span class="m-label m-default">T</span> to search, <span class="m-label m-default">Esc</span> to close</div>
|
||
|
<div id="search-symbolcount">…</div>
|
||
|
</div>
|
||
|
<div class="m-doc-search-content">
|
||
|
<form>
|
||
|
<input type="search" name="q" id="search-input" placeholder="Loading …" disabled="disabled" autofocus="autofocus" autocomplete="off" spellcheck="false" />
|
||
|
</form>
|
||
|
<noscript class="m-text m-danger m-text-center">Unlike everything else in the docs, the search functionality <em>requires</em> JavaScript.</noscript>
|
||
|
<div id="search-help" class="m-text m-dim m-text-center">
|
||
|
<p class="m-noindent">Search for symbols, directories, files, pages or
|
||
|
modules. You can omit any prefix from the symbol or file path; adding a
|
||
|
<code>:</code> or <code>/</code> suffix lists all members of given symbol or
|
||
|
directory.</p>
|
||
|
<p class="m-noindent">Use <span class="m-label m-dim">↓</span>
|
||
|
/ <span class="m-label m-dim">↑</span> to navigate through the list,
|
||
|
<span class="m-label m-dim">Enter</span> to go.
|
||
|
<span class="m-label m-dim">Tab</span> autocompletes common prefix, you can
|
||
|
copy a link to the result using <span class="m-label m-dim">⌘</span>
|
||
|
<span class="m-label m-dim">L</span> while <span class="m-label m-dim">⌘</span>
|
||
|
<span class="m-label m-dim">M</span> produces a Markdown link.</p>
|
||
|
</div>
|
||
|
<div id="search-notfound" class="m-text m-warning m-text-center">Sorry, nothing was found.</div>
|
||
|
<ul id="search-results"></ul>
|
||
|
</div>
|
||
|
</div>
|
||
|
</div>
|
||
|
</div>
|
||
|
</div>
|
||
|
<script src="search-v2.js"></script>
|
||
|
<script src="searchdata-v2.js" async="async"></script>
|
||
|
<footer><nav>
|
||
|
<div class="m-container">
|
||
|
<div class="m-row">
|
||
|
<div class="m-col-l-10 m-push-l-1">
|
||
|
<p>Taskflow handbook is part of the <a href="https://taskflow.github.io">Taskflow project</a>, copyright © <a href="https://tsung-wei-huang.github.io/">Dr. Tsung-Wei Huang</a>, 2018–2024.<br />Generated by <a href="https://doxygen.org/">Doxygen</a> 1.9.1 and <a href="https://mcss.mosra.cz/">m.css</a>.</p>
|
||
|
</div>
|
||
|
</div>
|
||
|
</div>
|
||
|
</nav></footer>
|
||
|
</body>
|
||
|
</html>
|