mesytec-mnode/external/taskflow-3.8.0/docs/xml/ForEachCUDA.xml

<?xml version='1.0' encoding='UTF-8' standalone='no'?>
<doxygen xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="compound.xsd" version="1.9.1" xml:lang="en-US">
  <compounddef id="ForEachCUDA" kind="page">
    <compoundname>ForEachCUDA</compoundname>
    <title>Parallel Iterations</title>
    <tableofcontents>
      <tocsect>
        <name>Include the Header</name>
        <reference>ForEachCUDA_1CUDAForEachIncludeTheHeader</reference>
    </tocsect>
      <tocsect>
        <name>Index-based Parallel Iterations</name>
        <reference>ForEachCUDA_1ForEachCUDAIndexBasedParallelFor</reference>
    </tocsect>
      <tocsect>
        <name>Iterator-based Parallel Iterations</name>
        <reference>ForEachCUDA_1ForEachCUDAIteratorBasedParallelIterations</reference>
    </tocsect>
      <tocsect>
        <name>Miscellaneous Items</name>
        <reference>ForEachCUDA_1ForEachCUDAMiscellaneousItems</reference>
    </tocsect>
    </tableofcontents>
    <briefdescription>
    </briefdescription>
    <detaileddescription>
<para><ref refid="classtf_1_1cudaFlow" kindref="compound">tf::cudaFlow</ref> provides two template methods, <ref refid="classtf_1_1cudaFlow_1a1a681f6223853b6445dcfdad07e4d0fd" kindref="member">tf::cudaFlow::for_each</ref> and <ref refid="classtf_1_1cudaFlow_1a34f1ea89e5651faa6e8af522a42556ac" kindref="member">tf::cudaFlow::for_each_index</ref>, for creating tasks to perform parallel iterations over a range of items.</para>
<sect1 id="ForEachCUDA_1CUDAForEachIncludeTheHeader">
<title>Include the Header</title>
<para>You need to include the header file, <computeroutput>taskflow/cuda/algorithm/for_each.hpp</computeroutput>, for creating a parallel-iteration task.</para>
<para><programlisting filename=".cpp"><codeline><highlight class="preprocessor">#include<sp/>&lt;<ref refid="for__each_8hpp" kindref="compound">taskflow/cuda/algorithm/for_each.hpp</ref>&gt;</highlight></codeline>
</programlisting></para>
</sect1>
<sect1 id="ForEachCUDA_1ForEachCUDAIndexBasedParallelFor">
<title>Index-based Parallel Iterations</title>
<para>Index-based parallel-for performs parallel iterations over a range <computeroutput>[first, last)</computeroutput> with the given <computeroutput>step</computeroutput> size. The task created by <ref refid="classtf_1_1cudaFlow_1a34f1ea89e5651faa6e8af522a42556ac" kindref="member">tf::cudaFlow::for_each_index(I first, I last, I step, C callable)</ref> represents a kernel of parallel execution for the following loop:</para>
<para><programlisting filename=".cpp"><codeline><highlight class="comment">//<sp/>positive<sp/>step:<sp/>first,<sp/>first+step,<sp/>first+2*step,<sp/>...</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keywordflow">for</highlight><highlight class="normal">(</highlight><highlight class="keyword">auto</highlight><highlight class="normal"><sp/>i=first;<sp/>i&lt;last;<sp/>i+=step)<sp/>{</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>callable(i);</highlight></codeline>
<codeline><highlight class="normal">}</highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>negative<sp/>step:<sp/>first,<sp/>first-step,<sp/>first-2*step,<sp/>...</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal"></highlight><highlight class="keywordflow">for</highlight><highlight class="normal">(</highlight><highlight class="keyword">auto</highlight><highlight class="normal"><sp/>i=first;<sp/>i&gt;last;<sp/>i+=step)<sp/>{</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>callable(i);</highlight></codeline>
<codeline><highlight class="normal">}</highlight></codeline>
</programlisting></para>
<para>Each iteration <computeroutput>i</computeroutput> is independent of each other and is assigned one kernel thread to run the callable. Since the callable runs on GPU, it must be declared with a <computeroutput>__device__</computeroutput> specifier. The following example creates a kernel that assigns each entry of <computeroutput>gpu_data</computeroutput> to 1 over the range <computeroutput></computeroutput>[0, 100) with step size 1.</para>
<para><programlisting filename=".cpp"><codeline><highlight class="comment">//<sp/>assigns<sp/>each<sp/>element<sp/>in<sp/>gpu_data<sp/>to<sp/>1<sp/>over<sp/>the<sp/>range<sp/>[0,<sp/>100)<sp/>with<sp/>step<sp/>size<sp/>1</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal">cudaflow.for_each_index(0,<sp/>100,<sp/>1,<sp/>[gpu_data]<sp/>__device__<sp/>(</highlight><highlight class="keywordtype">int</highlight><highlight class="normal"><sp/>idx)<sp/>{</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>gpu_data[idx]<sp/>=<sp/>1;</highlight></codeline>
<codeline><highlight class="normal">});</highlight></codeline>
</programlisting></para>
</sect1>
<sect1 id="ForEachCUDA_1ForEachCUDAIteratorBasedParallelIterations">
<title>Iterator-based Parallel Iterations</title>
<para>Iterator-based parallel-for performs parallel iterations over a range specified by two STL-styled iterators, <computeroutput>first</computeroutput> and <computeroutput>last</computeroutput>. The task created by <ref refid="classtf_1_1cudaFlow_1a1a681f6223853b6445dcfdad07e4d0fd" kindref="member">tf::cudaFlow::for_each(I first, I last, C callable)</ref> represents a parallel execution of the following loop:</para>
<para><programlisting filename=".cpp"><codeline><highlight class="keywordflow">for</highlight><highlight class="normal">(</highlight><highlight class="keyword">auto</highlight><highlight class="normal"><sp/>i=first;<sp/>i&lt;last;<sp/>i++)<sp/>{</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>callable(*i);</highlight></codeline>
<codeline><highlight class="normal">}</highlight></codeline>
</programlisting></para>
<para>The two iterators, <computeroutput>first</computeroutput> and <computeroutput>last</computeroutput>, are typically two raw pointers to the first element and the next to the last element in the range in GPU memory space. The following example creates a <computeroutput>for_each</computeroutput> kernel that assigns each element in <computeroutput>gpu_data</computeroutput> to 1 over the range <computeroutput>[gpu_data, gpu_data + 1000)</computeroutput>.</para>
<para><programlisting filename=".cpp"><codeline><highlight class="comment">//<sp/>assigns<sp/>each<sp/>element<sp/>to<sp/>1<sp/>over<sp/>the<sp/>range<sp/>[gpu_data,<sp/>gpu_data<sp/>+<sp/>1000)</highlight><highlight class="normal"></highlight></codeline>
<codeline><highlight class="normal">cudaflow.for_each(gpu_data,<sp/>gpu_data<sp/>+<sp/>1000,<sp/>[]<sp/>__device__<sp/>(</highlight><highlight class="keywordtype">int</highlight><highlight class="normal">&amp;<sp/>item)<sp/>{</highlight></codeline>
<codeline><highlight class="normal"><sp/><sp/>item<sp/>=<sp/>1;</highlight></codeline>
<codeline><highlight class="normal">});<sp/></highlight></codeline>
</programlisting></para>
<para>Each iteration is independent of each other and is assigned one kernel thread to run the callable. Since the callable runs on GPU, it must be declared with a <computeroutput>__device__</computeroutput> specifier.</para>
</sect1>
<sect1 id="ForEachCUDA_1ForEachCUDAMiscellaneousItems">
<title>Miscellaneous Items</title>
<para>The parallel-iteration algorithms are also available in <ref refid="classtf_1_1cudaFlowCapturer_1a0b2f1bcd59f0b42e0f823818348b4ae7" kindref="member">tf::cudaFlowCapturer::for_each</ref> and <ref refid="classtf_1_1cudaFlowCapturer_1aeb877f42ee3a627c40f1c9c84e31ba3c" kindref="member">tf::cudaFlowCapturer::for_each_index</ref>. </para>
</sect1>
    </detaileddescription>
    <location file="doxygen/cudaflow_algorithms/cudaflow_for_each.dox"/>
  </compounddef>
</doxygen>
add taskflow-3.8.0 2025-01-04 01:25:05 +01:00			`<?xml version='1.0' encoding='UTF-8' standalone='no'?>`
			`<doxygen xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="compound.xsd" version="1.9.1" xml:lang="en-US">`
			`<compounddef id="ForEachCUDA" kind="page">`
			`<compoundname>ForEachCUDA</compoundname>`
			`<title>Parallel Iterations</title>`
			`<tableofcontents>`
			`<tocsect>`
			`<name>Include the Header</name>`
			`<reference>ForEachCUDA_1CUDAForEachIncludeTheHeader</reference>`
			`</tocsect>`
			`<tocsect>`
			`<name>Index-based Parallel Iterations</name>`
			`<reference>ForEachCUDA_1ForEachCUDAIndexBasedParallelFor</reference>`
			`</tocsect>`
			`<tocsect>`
			`<name>Iterator-based Parallel Iterations</name>`
			`<reference>ForEachCUDA_1ForEachCUDAIteratorBasedParallelIterations</reference>`
			`</tocsect>`
			`<tocsect>`
			`<name>Miscellaneous Items</name>`
			`<reference>ForEachCUDA_1ForEachCUDAMiscellaneousItems</reference>`
			`</tocsect>`
			`</tableofcontents>`
			`<briefdescription>`
			`</briefdescription>`
			`<detaileddescription>`
			`<para><ref refid="classtf_1_1cudaFlow" kindref="compound">tf::cudaFlow</ref> provides two template methods, <ref refid="classtf_1_1cudaFlow_1a1a681f6223853b6445dcfdad07e4d0fd" kindref="member">tf::cudaFlow::for_each</ref> and <ref refid="classtf_1_1cudaFlow_1a34f1ea89e5651faa6e8af522a42556ac" kindref="member">tf::cudaFlow::for_each_index</ref>, for creating tasks to perform parallel iterations over a range of items.</para>`
			`<sect1 id="ForEachCUDA_1CUDAForEachIncludeTheHeader">`
			`<title>Include the Header</title>`
			`<para>You need to include the header file, <computeroutput>taskflow/cuda/algorithm/for_each.hpp</computeroutput>, for creating a parallel-iteration task.</para>`
			`<para><programlisting filename=".cpp"><codeline><highlight class="preprocessor">#include<sp/><<ref refid="for__each_8hpp" kindref="compound">taskflow/cuda/algorithm/for_each.hpp</ref>></highlight></codeline>`
			`</programlisting></para>`
			`</sect1>`
			`<sect1 id="ForEachCUDA_1ForEachCUDAIndexBasedParallelFor">`
			`<title>Index-based Parallel Iterations</title>`
			`<para>Index-based parallel-for performs parallel iterations over a range <computeroutput>[first, last)</computeroutput> with the given <computeroutput>step</computeroutput> size. The task created by <ref refid="classtf_1_1cudaFlow_1a34f1ea89e5651faa6e8af522a42556ac" kindref="member">tf::cudaFlow::for_each_index(I first, I last, I step, C callable)</ref> represents a kernel of parallel execution for the following loop:</para>`
			`<para><programlisting filename=".cpp"><codeline><highlight class="comment">//<sp/>positive<sp/>step:<sp/>first,<sp/>first+step,<sp/>first+2*step,<sp/>...</highlight><highlight class="normal"></highlight></codeline>`
			`<codeline><highlight class="normal"></highlight><highlight class="keywordflow">for</highlight><highlight class="normal">(</highlight><highlight class="keyword">auto</highlight><highlight class="normal"><sp/>i=first;<sp/>i<last;<sp/>i+=step)<sp/>{</highlight></codeline>`
			`<codeline><highlight class="normal"><sp/><sp/>callable(i);</highlight></codeline>`
			`<codeline><highlight class="normal">}</highlight></codeline>`
			`<codeline><highlight class="normal"></highlight><highlight class="comment">//<sp/>negative<sp/>step:<sp/>first,<sp/>first-step,<sp/>first-2*step,<sp/>...</highlight><highlight class="normal"></highlight></codeline>`
			`<codeline><highlight class="normal"></highlight><highlight class="keywordflow">for</highlight><highlight class="normal">(</highlight><highlight class="keyword">auto</highlight><highlight class="normal"><sp/>i=first;<sp/>i>last;<sp/>i+=step)<sp/>{</highlight></codeline>`
			`<codeline><highlight class="normal"><sp/><sp/>callable(i);</highlight></codeline>`
			`<codeline><highlight class="normal">}</highlight></codeline>`
			`</programlisting></para>`
			`<para>Each iteration <computeroutput>i</computeroutput> is independent of each other and is assigned one kernel thread to run the callable. Since the callable runs on GPU, it must be declared with a <computeroutput>__device__</computeroutput> specifier. The following example creates a kernel that assigns each entry of <computeroutput>gpu_data</computeroutput> to 1 over the range <computeroutput></computeroutput>[0, 100) with step size 1.</para>`
			`<para><programlisting filename=".cpp"><codeline><highlight class="comment">//<sp/>assigns<sp/>each<sp/>element<sp/>in<sp/>gpu_data<sp/>to<sp/>1<sp/>over<sp/>the<sp/>range<sp/>[0,<sp/>100)<sp/>with<sp/>step<sp/>size<sp/>1</highlight><highlight class="normal"></highlight></codeline>`
			`<codeline><highlight class="normal">cudaflow.for_each_index(0,<sp/>100,<sp/>1,<sp/>[gpu_data]<sp/>__device__<sp/>(</highlight><highlight class="keywordtype">int</highlight><highlight class="normal"><sp/>idx)<sp/>{</highlight></codeline>`
			`<codeline><highlight class="normal"><sp/><sp/>gpu_data[idx]<sp/>=<sp/>1;</highlight></codeline>`
			`<codeline><highlight class="normal">});</highlight></codeline>`
			`</programlisting></para>`
			`</sect1>`
			`<sect1 id="ForEachCUDA_1ForEachCUDAIteratorBasedParallelIterations">`
			`<title>Iterator-based Parallel Iterations</title>`
			`<para>Iterator-based parallel-for performs parallel iterations over a range specified by two STL-styled iterators, <computeroutput>first</computeroutput> and <computeroutput>last</computeroutput>. The task created by <ref refid="classtf_1_1cudaFlow_1a1a681f6223853b6445dcfdad07e4d0fd" kindref="member">tf::cudaFlow::for_each(I first, I last, C callable)</ref> represents a parallel execution of the following loop:</para>`
			`<para><programlisting filename=".cpp"><codeline><highlight class="keywordflow">for</highlight><highlight class="normal">(</highlight><highlight class="keyword">auto</highlight><highlight class="normal"><sp/>i=first;<sp/>i<last;<sp/>i++)<sp/>{</highlight></codeline>`
			`<codeline><highlight class="normal"><sp/><sp/>callable(*i);</highlight></codeline>`
			`<codeline><highlight class="normal">}</highlight></codeline>`
			`</programlisting></para>`
			`<para>The two iterators, <computeroutput>first</computeroutput> and <computeroutput>last</computeroutput>, are typically two raw pointers to the first element and the next to the last element in the range in GPU memory space. The following example creates a <computeroutput>for_each</computeroutput> kernel that assigns each element in <computeroutput>gpu_data</computeroutput> to 1 over the range <computeroutput>[gpu_data, gpu_data + 1000)</computeroutput>.</para>`
			`<para><programlisting filename=".cpp"><codeline><highlight class="comment">//<sp/>assigns<sp/>each<sp/>element<sp/>to<sp/>1<sp/>over<sp/>the<sp/>range<sp/>[gpu_data,<sp/>gpu_data<sp/>+<sp/>1000)</highlight><highlight class="normal"></highlight></codeline>`
			`<codeline><highlight class="normal">cudaflow.for_each(gpu_data,<sp/>gpu_data<sp/>+<sp/>1000,<sp/>[]<sp/>__device__<sp/>(</highlight><highlight class="keywordtype">int</highlight><highlight class="normal">&<sp/>item)<sp/>{</highlight></codeline>`
			`<codeline><highlight class="normal"><sp/><sp/>item<sp/>=<sp/>1;</highlight></codeline>`
			`<codeline><highlight class="normal">});<sp/></highlight></codeline>`
			`</programlisting></para>`
			`<para>Each iteration is independent of each other and is assigned one kernel thread to run the callable. Since the callable runs on GPU, it must be declared with a <computeroutput>__device__</computeroutput> specifier.</para>`
			`</sect1>`
			`<sect1 id="ForEachCUDA_1ForEachCUDAMiscellaneousItems">`
			`<title>Miscellaneous Items</title>`
			`<para>The parallel-iteration algorithms are also available in <ref refid="classtf_1_1cudaFlowCapturer_1a0b2f1bcd59f0b42e0f823818348b4ae7" kindref="member">tf::cudaFlowCapturer::for_each</ref> and <ref refid="classtf_1_1cudaFlowCapturer_1aeb877f42ee3a627c40f1c9c84e31ba3c" kindref="member">tf::cudaFlowCapturer::for_each_index</ref>. </para>`
			`</sect1>`
			`</detaileddescription>`
			`<location file="doxygen/cudaflow_algorithms/cudaflow_for_each.dox"/>`
			`</compounddef>`
			`</doxygen>`