206 lines
6.1 KiB
Text
206 lines
6.1 KiB
Text
namespace tf {
|
||
|
||
/** @page CompileTaskflowWithCUDA Compile Taskflow with CUDA
|
||
|
||
@tableofcontents
|
||
|
||
@section InstallCUDACompiler Install CUDA Compiler
|
||
|
||
To compile %Taskflow with CUDA code, you need a @c nvcc compiler.
|
||
Please visit the official page of
|
||
<a href="https://developer.nvidia.com/cuda-downloads">Downloading CUDA Toolkit</a>.
|
||
|
||
@section CompileTaskflowWithCUDADirectly Compile Source Code Directly
|
||
|
||
%Taskflow's GPU programming interface for CUDA is tf::cudaFlow.
|
||
Consider the following `simple.cu` program that launches a single kernel
|
||
function to output a message:
|
||
|
||
@code{.cpp}
|
||
#include <taskflow/taskflow.hpp>
|
||
#include <taskflow/cudaflow.hpp>
|
||
#include <taskflow/cuda/for_each.hpp>
|
||
|
||
int main(int argc, const char** argv) {
|
||
|
||
tf::Executor executor;
|
||
tf::Taskflow taskflow;
|
||
|
||
tf::Task task1 = taskflow.emplace([](){}).name("cpu task");
|
||
tf::Task task2 = taskflow.emplace([](){
|
||
// create a cudaFlow of a single-threaded task
|
||
tf::cudaFlow cf;
|
||
cf.single_task([] __device__ () { printf("hello cudaFlow!\n"); });
|
||
|
||
// launch the cudaflow through a stream
|
||
tf::cudaStream stream;
|
||
cf.run(stream);
|
||
stream.synchronize();
|
||
}).name("gpu task");
|
||
|
||
task1.precede(task2);
|
||
|
||
executor.run(taskflow).wait();
|
||
return 0;
|
||
}
|
||
@endcode
|
||
|
||
The easiest way to compile %Taskflow with CUDA code (e.g., %cudaFlow, kernels)
|
||
is to use @nvcc:
|
||
|
||
@code{.shell-session}
|
||
~$ nvcc -std=c++17 -I path/to/taskflow/ --extended-lambda simple.cu -o simple
|
||
~$ ./simple
|
||
hello cudaFlow!
|
||
@endcode
|
||
|
||
|
||
@section CompileTaskflowWithCUDASeparately Compile Source Code Separately
|
||
|
||
Large GPU applications often compile a program into separate objects
|
||
and link them together to form an executable or a library.
|
||
You can compile your CPU code and GPU code separately with %Taskflow
|
||
using @c nvcc and other compilers (such as @c g++ and @c clang++).
|
||
Consider the following example that defines two tasks
|
||
on two different pieces (@c main.cpp and @c cudaflow.cpp) of source code:
|
||
|
||
@code{.cpp}
|
||
// main.cpp
|
||
#include <taskflow/taskflow.hpp>
|
||
|
||
tf::Task make_cudaflow(tf::Taskflow& taskflow); // create a cudaFlow task
|
||
|
||
int main() {
|
||
|
||
tf::Executor executor;
|
||
tf::Taskflow taskflow;
|
||
|
||
tf::Task task1 = taskflow.emplace([](){ std::cout << "main.cpp!\n"; })
|
||
.name("cpu task");
|
||
tf::Task task2 = make_cudaflow(taskflow);
|
||
|
||
task1.precede(task2);
|
||
|
||
executor.run(taskflow).wait();
|
||
|
||
return 0;
|
||
}
|
||
@endcode
|
||
|
||
@code{.cpp}
|
||
// cudaflow.cpp
|
||
#include <taskflow/taskflow.hpp>
|
||
#include <taskflow/cudaflow.hpp>
|
||
|
||
tf::Task make_cudaflow(tf::Taskflow& taskflow) {
|
||
return taskflow.emplace([](){
|
||
// create a cudaFlow of a single-threaded task
|
||
tf::cudaFlow cf;
|
||
cf.single_task([] __device__ () { printf("cudaflow.cpp!\n"); });
|
||
|
||
// launch the cudaflow through a stream
|
||
tf::cudaStream stream;
|
||
cf.run(stream);
|
||
stream.synchronize();
|
||
}).name("gpu task");
|
||
}
|
||
@endcode
|
||
|
||
Compile each source to an object (@c g++ as an example):
|
||
|
||
@code{.shell-session}
|
||
~$ g++ -std=c++17 -I path/to/taskflow -c main.cpp -o main.o
|
||
~$ nvcc -std=c++17 --extended-lambda -x cu -I path/to/taskflow \
|
||
-dc cudaflow.cpp -o cudaflow.o
|
||
~$ ls
|
||
# now we have the two compiled .o objects, main.o and cudaflow.o
|
||
main.o cudaflow.o
|
||
@endcode
|
||
|
||
The @c --extended-lambda option tells @c nvcc to generate GPU code
|
||
for the lambda defined with <tt>__device__</tt>.
|
||
The <tt>-x cu</tt> tells @c nvcc to treat the input files as @c .cu files
|
||
containing both CPU and GPU code.
|
||
By default, @c nvcc treats @c .cpp files as CPU-only code.
|
||
This option is required to have @c nvcc generate device code here,
|
||
but it is also a handy way to avoid renaming source files in larger projects.
|
||
The @c –dc option tells @c nvcc to generate device code for later linking.
|
||
|
||
You may also need to specify the target architecture to tell @c nvcc to target
|
||
on a compatible SM architecture using the option @-arch.
|
||
For instance, the following command requires
|
||
device code linking to have compute capability 7.5 or later:
|
||
|
||
@code{.shell-session}
|
||
~$ nvcc -std=c++17 --extended-lambda -x cu -arch=sm_75 -I path/to/taskflow \
|
||
-dc cudaflow.cpp -o cudaflow.o
|
||
@endcode
|
||
|
||
@subsection CompileTaskflowWithCUDANaiveLinking Link Objects Using nvcc
|
||
|
||
Using @c nvcc to link compiled object code is nothing special but
|
||
replacing the normal compiler with @c nvcc and it takes care of all the
|
||
necessary steps:
|
||
|
||
@code{.shell-session}
|
||
~$ nvcc main.o cudaflow.o -o main
|
||
|
||
# run the main program
|
||
~$ ./main
|
||
main.cpp!
|
||
cudaflow.cpp!
|
||
@endcode
|
||
|
||
@subsection CompileTaskflowWithCUDADifferentLinkers Link Objects Using Different Linkers
|
||
|
||
You can choose to use a compiler other than @c nvcc for the final link step.
|
||
Since your CPU compiler does not know how to link CUDA device code,
|
||
you have to add a step in your build to have @c nvcc link the CUDA device code,
|
||
using the option @c -dlink:
|
||
|
||
@code{.shell-session}
|
||
~$ nvcc -o gpuCode.o -dlink main.o cudaflow.o
|
||
@endcode
|
||
|
||
This step links all the <i>device object code</i> and places it into @c gpuCode.o.
|
||
|
||
@note
|
||
Note that this step does not link the CPU object code
|
||
and discards the CPU object code in @c main.o and @c cudaflow.o.
|
||
|
||
|
||
To complete the link to an executable, you can use, for example, @c ld or @c g++.
|
||
|
||
@code{.shell-session}
|
||
# replace /usr/local/cuda/lib64 with your own CUDA library installation path
|
||
~$ g++ -pthread -L /usr/local/cuda/lib64/ -lcudart \
|
||
gpuCode.o main.o cudaflow.o -o main
|
||
|
||
# run the main program
|
||
~$ ./main
|
||
main.cpp!
|
||
cudaflow.cpp!
|
||
@endcode
|
||
|
||
We give @c g++ all of the objects again because it needs the CPU object code,
|
||
which is not in @c gpuCode.o.
|
||
The device code stored in the original objects, @c main.o and @c cudaflow.o,
|
||
does not conflict with the code in @c gpuCode.o.
|
||
@c g++ ignores device code because it does not know how to link it,
|
||
and the device code in @c gpuCode.o is already linked and ready to go.
|
||
|
||
@note
|
||
This intentional ignorance is extremely useful in large builds
|
||
where intermediate objects may have both CPU and GPU code.
|
||
In this case, we just let the GPU and CPU linkers each do its own job,
|
||
noting that the CPU linker is always the last one we run.
|
||
The CUDA Runtime API library is automatically linked when we use @c nvcc for linking,
|
||
but we must explicitly link it (@c -lcudart) when using another linker.
|
||
|
||
|
||
*/
|
||
|
||
|
||
}
|
||
|
||
|