309 lines
11 KiB
Text
309 lines
11 KiB
Text
namespace tf {
|
|
|
|
/** @page GPUTaskingcudaFlow GPU Tasking (%cudaFlow)
|
|
|
|
Modern scientific computing typically leverages
|
|
GPU-powered parallel processing cores to speed up large-scale applications.
|
|
This chapter discusses how to implement CPU-GPU heterogeneous tasking algorithms
|
|
with @NvidiaCUDA.
|
|
|
|
@tableofcontents
|
|
|
|
@section GPUTaskingcudaFlowIncludeTheHeader Include the Header
|
|
|
|
You need to include the header file, `%taskflow/cuda/cudaflow.hpp`,
|
|
for creating a GPU task graph using tf::cudaFlow.
|
|
|
|
@code{.cpp}
|
|
#include <taskflow/cuda/cudaflow.hpp>
|
|
@endcode
|
|
|
|
@section WhatIsACudaGraph What is a CUDA Graph?
|
|
|
|
CUDA %Graph is a new execution model that enables
|
|
a series of CUDA kernels to be defined and encapsulated as a single unit,
|
|
i.e., a task graph of operations,
|
|
rather than a sequence of individually-launched operations.
|
|
This organization allows launching multiple GPU operations through a single CPU operation
|
|
and hence reduces the launching overheads, especially for kernels of short running time.
|
|
The benefit of CUDA %Graph can be demonstrated in the figure below:
|
|
|
|
@image html images/cuda_graph_benefit.png
|
|
|
|
In this example, a sequence of short kernels is launched one-by-one by the CPU.
|
|
The CPU launching overhead creates a significant gap in between the kernels.
|
|
If we replace this sequence of kernels with a CUDA graph,
|
|
initially we will need to spend a little extra time on building the graph and
|
|
launching the whole graph in one go on the first occasion,
|
|
but subsequent executions will be very fast, as there will be very little gap between the kernels.
|
|
The difference is more pronounced when the same sequence of operations is repeated many times,
|
|
for example, many training epochs in machine learning workloads.
|
|
In that case, the initial costs of building and launching the graph will be amortized
|
|
over the entire training iterations.
|
|
|
|
@note
|
|
A comprehensive introduction about CUDA %Graph can be referred to
|
|
the [CUDA %Graph Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs).
|
|
|
|
@section Create_a_cudaFlow Create a cudaFlow
|
|
|
|
%Taskflow leverages @cudaGraph to enable concurrent CPU-GPU tasking
|
|
using a task graph model called tf::cudaFlow.
|
|
A %cudaFlow manages a CUDA graph explicitly
|
|
to execute dependent GPU operations in a single CPU call.
|
|
The following example implements a %cudaFlow that performs
|
|
an saxpy (A·X Plus Y) workload:
|
|
|
|
@code{.cpp}
|
|
#include <taskflow/cuda/cudaflow.hpp>
|
|
|
|
// saxpy (single-precision A·X Plus Y) kernel
|
|
__global__ void saxpy(int n, float a, float *x, float *y) {
|
|
int i = blockIdx.x*blockDim.x + threadIdx.x;
|
|
if (i < n) {
|
|
y[i] = a*x[i] + y[i];
|
|
}
|
|
}
|
|
|
|
// main function begins
|
|
int main() {
|
|
|
|
const unsigned N = 1<<20; // size of the vector
|
|
|
|
std::vector<float> hx(N, 1.0f); // x vector at host
|
|
std::vector<float> hy(N, 2.0f); // y vector at host
|
|
|
|
float *dx{nullptr}; // x vector at device
|
|
float *dy{nullptr}; // y vector at device
|
|
|
|
cudaMalloc(&dx, N*sizeof(float));
|
|
cudaMalloc(&dy, N*sizeof(float));
|
|
|
|
tf::cudaFlow cudaflow;
|
|
|
|
// create data transfer tasks
|
|
tf::cudaTask h2d_x = cudaflow.copy(dx, hx.data(), N).name("h2d_x");
|
|
tf::cudaTask h2d_y = cudaflow.copy(dy, hy.data(), N).name("h2d_y");
|
|
tf::cudaTask d2h_x = cudaflow.copy(hx.data(), dx, N).name("d2h_x");
|
|
tf::cudaTask d2h_y = cudaflow.copy(hy.data(), dy, N).name("d2h_y");
|
|
|
|
// launch saxpy<<<(N+255)/256, 256, 0>>>(N, 2.0f, dx, dy)
|
|
tf::cudaTask kernel = cudaflow.kernel(
|
|
(N+255)/256, 256, 0, saxpy, N, 2.0f, dx, dy
|
|
).name("saxpy");
|
|
|
|
kernel.succeed(h2d_x, h2d_y)
|
|
.precede(d2h_x, d2h_y);
|
|
|
|
// run the cudaflow through a stream
|
|
tf::cudaStream stream;
|
|
cudaflow.run(stream)
|
|
stream.synchronize();
|
|
|
|
// dump the cudaflow
|
|
cudaflow.dump(std::cout);
|
|
}
|
|
@endcode
|
|
|
|
The %cudaFlow graph consists of two CPU-to-GPU data copies (@c h2d_x and @c h2d_y),
|
|
one kernel (@c saxpy), and two GPU-to-CPU data copies (@c d2h_x and @c d2h_y),
|
|
in this order of their task dependencies.
|
|
|
|
<!-- @image html images/saxpy.svg width=60% -->
|
|
@dotfile images/saxpy.dot
|
|
|
|
|
|
We do not expend yet another effort on simplifying kernel programming
|
|
but focus on tasking CUDA operations and their dependencies.
|
|
In other words, tf::cudaFlow is a lightweight C++ abstraction over CUDA %Graph.
|
|
This organization lets users fully take advantage of CUDA features
|
|
that are commensurate with their domain knowledge,
|
|
while leaving difficult task parallelism details to %Taskflow.
|
|
|
|
@section Compile_a_cudaFlow_program Compile a cudaFlow Program
|
|
|
|
Use @nvcc to compile a %cudaFlow program:
|
|
|
|
@code{.shell-session}
|
|
~$ nvcc -std=c++17 my_cudaflow.cu -I path/to/include/taskflow -O2 -o my_cudaflow
|
|
~$ ./my_cudaflow
|
|
@endcode
|
|
|
|
Please visit the page @ref CompileTaskflowWithCUDA for more details.
|
|
|
|
@section run_a_cudaflow_on_a_specific_gpu Run a cudaFlow on Specific GPU
|
|
|
|
By default, a %cudaFlow runs on the current GPU context associated with the caller,
|
|
which is typically GPU @c 0.
|
|
Each CUDA GPU has an integer identifier in the range of <tt>[0, N)</tt>
|
|
to represent the context of that GPU,
|
|
where @c N is the number of GPUs in the system.
|
|
You can run a %cudaFlow on a specific GPU by switching the context to a different GPU
|
|
using tf::cudaScopedDevice.
|
|
The code below creates a %cudaFlow and runs it on GPU @c 2.
|
|
|
|
@code{.cpp}
|
|
{
|
|
// create an RAII-styled switcher to the context of GPU 2
|
|
tf::cudaScopedDevice context(2);
|
|
|
|
// create a cudaFlow capturer under GPU 2
|
|
tf::cudaFlowCapturer capturer;
|
|
// ...
|
|
|
|
// create a stream under GPU 2 and offload the capturer to that GPU
|
|
tf::cudaStream stream;
|
|
capturer.run(stream);
|
|
stream.synchronize();
|
|
}
|
|
@endcode
|
|
|
|
tf::cudaScopedDevice is an RAII-styled wrapper to perform @em scoped switch
|
|
to the given GPU context.
|
|
When the scope is destroyed, it switches back to the original context.
|
|
|
|
@attention
|
|
tf::cudaScopedDevice allows you to place a %cudaFlow on a particular GPU device,
|
|
but it is your responsibility to ensure correct memory access.
|
|
For example, you may not allocate a memory block on GPU @c 2 while
|
|
accessing it from a kernel on GPU @c 0.
|
|
An easy practice for multi-GPU programming is to allocate <i>unified shared memory</i> using @c cudaMallocManaged
|
|
and let the CUDA runtime perform automatic memory migration between GPUs.
|
|
|
|
@section GPUMemoryOperations Create Memory Operation Tasks
|
|
|
|
%cudaFlow provides a set of methods for users to manipulate device memory.
|
|
There are two categories, @em raw data and @em typed data.
|
|
Raw data operations are methods with prefix @c mem, such as @c memcpy and @c memset,
|
|
that operate in @em bytes.
|
|
Typed data operations such as @c copy, @c fill, and @c zero,
|
|
take <i>logical count</i> of elements.
|
|
For instance, the following three methods have the same result of zeroing
|
|
<tt>sizeof(int)*count</tt> bytes of the device memory area pointed to by @c target.
|
|
|
|
@code{.cpp}
|
|
int* target;
|
|
cudaMalloc(&target, count*sizeof(int));
|
|
|
|
tf::cudaFlow cudaflow;
|
|
memset_target = cudaflow.memset(target, 0, sizeof(int) * count);
|
|
same_as_above = cudaflow.fill(target, 0, count);
|
|
same_as_above_again = cudaflow.zero(target, count);
|
|
@endcode
|
|
|
|
The method tf::cudaFlow::fill is a more powerful variant of tf::cudaFlow::memset.
|
|
It can fill a memory area with any value of type @c T,
|
|
given that <tt>sizeof(T)</tt> is 1, 2, or 4 bytes.
|
|
The following example creates a GPU task to fill @c count elements
|
|
in the array @c target with value @c 1234.
|
|
|
|
@code{.cpp}
|
|
cf.fill(target, 1234, count);
|
|
@endcode
|
|
|
|
Similar concept applies to tf::cudaFlow::memcpy and tf::cudaFlow::copy as well.
|
|
The following two methods are equivalent to each other.
|
|
|
|
@code{.cpp}
|
|
cudaflow.memcpy(target, source, sizeof(int) * count);
|
|
cudaflow.copy(target, source, count);
|
|
@endcode
|
|
|
|
@section OffloadAcudaFlow Offload a cudaFlow
|
|
|
|
To offload a %cudaFlow to a GPU, you need to use tf::cudaFlow::run
|
|
and pass a tf::cudaStream created on that GPU.
|
|
The run method is asynchronous and can be explicitly synchronized
|
|
through the given stream.
|
|
|
|
@code{.cpp}
|
|
tf::cudaStream stream;
|
|
// launch a cudaflow asynchronously through a stream
|
|
cudaflow.run(stream);
|
|
// wait for the cudaflow to finish
|
|
stream.synchronize();
|
|
@endcode
|
|
|
|
When you offload a %cudaFlow using tf::cudaFlow::run,
|
|
the runtime transforms that %cudaFlow (i.e., application GPU task graph)
|
|
into a native executable instance and submit it to the CUDA runtime for execution.
|
|
There is always an one-to-one mapping between
|
|
%cudaFlow and its native CUDA graph representation (except those constructed
|
|
by using tf::cudaFlowCapturer).
|
|
|
|
@section UpdateAcudaFlow Update a cudaFlow
|
|
|
|
Many GPU applications require you to launch a %cudaFlow multiple times
|
|
and update node parameters (e.g., kernel parameters and memory addresses)
|
|
between iterations.
|
|
%cudaFlow allows you to update the parameters of created tasks
|
|
and
|
|
run the updated %cudaFlow with new parameters.
|
|
Every task-creation method in tf::cudaFlow has an overload
|
|
to update the parameters of a created task by that method.
|
|
|
|
@code{.cpp}
|
|
tf::cudaStream stream;
|
|
tf::cudaFlow cf;
|
|
|
|
// create a kernel task
|
|
tf::cudaTask task = cf.kernel(grid1, block1, shm1, kernel, kernel_args_1);
|
|
cf.run(stream);
|
|
stream.synchronize();
|
|
|
|
// update the created kernel task with different parameters
|
|
cf.kernel(task, grid2, block2, shm2, kernel, kernel_args_2);
|
|
cf.run(stream);
|
|
stream.synchronize();
|
|
@endcode
|
|
|
|
Between successive offloads (i.e., iterative executions of a %cudaFlow),
|
|
you can @em ONLY update task parameters,
|
|
such as changing the kernel execution parameters and memory operation parameters.
|
|
However, you must @em NOT change the topology of the %cudaFlow,
|
|
such as adding a new task or adding a new dependency.
|
|
This is the limitation of CUDA %Graph.
|
|
|
|
@attention
|
|
There are a few restrictions on updating task parameters in a %cudaFlow.
|
|
Notably, you must @em NOT change the topology of an offloaded graph.
|
|
In addition, update methods have the following limitations:
|
|
+ kernel task
|
|
+ The kernel function is not allowed to change. This restriction applies to all algorithm tasks that are created using lambda.
|
|
+ memset and memcpy tasks:
|
|
+ The CUDA device(s) to which the operand(s) was allocated/mapped
|
|
cannot change
|
|
+ The source/destination memory must be allocated from the same
|
|
contexts as the original source/destination memory.
|
|
|
|
@section IntegrateCudaFlowIntoTaskflow Integrate a cudaFlow into Taskflow
|
|
|
|
You can create a task to enclose a %cudaFlow and run it from a worker thread.
|
|
The usage of the %cudaFlow remains the same except that the %cudaFlow is run by a worker thread
|
|
from a taskflow task.
|
|
The following example runs a %cudaFlow from a static task:
|
|
|
|
@code{.cpp}
|
|
tf::Executor executor;
|
|
tf::Taskflow taskflow;
|
|
|
|
taskflow.emplace([](){
|
|
// create a cudaFlow inside a static task
|
|
tf::cudaFlow cudaflow;
|
|
|
|
// ... create a kernel task
|
|
cudaflow.kernel(...);
|
|
|
|
// run the capturer through a stream
|
|
tf::cudaStream stream;
|
|
capturer.run(stream);
|
|
stream.synchronize();
|
|
});
|
|
@endcode
|
|
|
|
|
|
*/
|
|
|
|
}
|
|
|
|
|