GraphProcessingPipeline Graph Processing Pipeline Formulate the Graph Processing Pipeline Problem GraphProcessingPipeline_1FormulateTheGraphProcessingPipelineProblem Create a Graph Processing Pipeline GraphProcessingPipeline_1CreateAGraphProcessingPipeline Find a Topological Order of the Graph GraphProcessingPipeline_1GraphPipelineFindATopologicalOrderOfTheGraph Define the Stage Function GraphProcessingPipeline_1GraphPipelineDefineTheStageFunction Define the Pipes GraphProcessingPipeline_1GraphPipelineDefineThePipes Define the Task Graph GraphProcessingPipeline_1GraphPipelineDefineTheTaskGraph Submit the Task Graph GraphProcessingPipeline_1GraphPipelineSubmitTheTaskGraph Reference GraphProcessingPipeline_1GraphPipelineReference We study a graph processing pipeline that propagates a sequence of linearly dependent tasks over a dependency graph. In this particular workload, we will learn how to transform task graph parallelism into pipeline parallelism. Formulate the Graph Processing Pipeline Problem Given a directed acyclic graph (DAG), where each node encapsulates a sequence of linearly dependent tasks, namely stage tasks, and each edge represents a dependency between two tasks at the same stages of adjacent nodes. For example, assuming fi(u) represents the ith-stage task of node u, a dependency from u to v requires fi(u) to run before fi(v). The following figures shows an example of three stage tasks in a DAG of three nodes (A, B, and C) and two dependencies (A->B and A->C): While we can directly create a taskflow for the DAG (i.e., each task in the taskflow runs f1, f2, and f3 sequentially), we can describe the parallelism as a three-stage pipeline that propagates a topological order of the DAG through three stage tasks. Consider a valid topological order of this DAG, A, B, C, its pipeline parallelism can be illustrated in the following figure: At the beginning, f1(A) runs first. When f1(A) completes, it moves on to f2(A) and, meanwhile, f1(B) can start to run together with f2(A), and so on so forth. The straight line represents two parallel tasks that can overlap in time in the pipeline. For example, f3(A), f2(B), and f1(C) can run simultaneously. The following figures shows the task dependency graph of this pipeline workload: As we can see, tasks in diagonal lines (lower-left to upper-right) can run in parallel. This type of parallelism is also referred to as wavefront parallelism, which sweeps parallel elements in a diagonal direction. Depending on the graph size and the number of stage tasks, task graph parallelism and pipeline parallelism can bring very different performance results. For example, a small graph will a long chain of stage tasks may perform better with pipeline parallelism than task graph parallelism, and vice versa. Create a Graph Processing Pipeline Using the example from the previous section, we create a three-stage pipeline that encapsulates the three stage tasks (f1, f2, f3) in three pipes. By finding a topological order of the graph, we can transform the node dependency into a sequence of linearly dependent data tokens to feed into the pipeline. The overall implementation is shown below: #include<taskflow/taskflow.hpp> #include<taskflow/algorithm/pipeline.hpp> //1st-stagefunction voidf1(conststd::string&node){ printf("f1(%s)\n",node.c_str()); } //2nd-stagefunction voidf2(conststd::string&node){ printf("f2(%s)\n",node.c_str()); } //3rd-stagefunction voidf3(conststd::string&node){ printf("f3(%s)\n",node.c_str()); } intmain(){ tf::Taskflowtaskflow("graphprocessingpipeline"); tf::Executorexecutor; constsize_tnum_lines=2; //atopologicalorderofthegraph //|->B //A--| //|->C conststd::vector<std::string>nodes={"A","B","C"}; //thepipelineconsistsofthreeserialpipes //anduptotwoconcurrentschedulingtokens tf::Pipelinepl(num_lines, //firstpipecallsf1 tf::Pipe{tf::PipeType::SERIAL,[&](tf::Pipeflow&pf){ if(pf.token()==nodes.size()){ pf.stop(); } else{ f1(nodes[pf.token()]); } }}, //secondpipecallsf2 tf::Pipe{tf::PipeType::SERIAL,[&](tf::Pipeflow&pf){ f2(nodes[pf.token()]); }}, //thirdpipecallsf3 tf::Pipe{tf::PipeType::SERIAL,[&](tf::Pipeflow&pf){ f3(nodes[pf.token()]); }} ); //buildthepipelinegraphusingcomposition tf::Taskinit=taskflow.emplace([](){std::cout<<"ready\n";}) .name("startingpipeline"); tf::Tasktask=taskflow.composed_of(pl) .name("pipeline"); tf::Taskstop=taskflow.emplace([](){std::cout<<"stopped\n";}) .name("pipelinestopped"); //createtaskdependency init.precede(task); task.precede(stop); //dumpthepipelinegraphstructure(withcomposition) taskflow.dump(std::cout); //runthepipeline executor.run(taskflow).wait(); return0; } Find a Topological Order of the Graph The first step is to find a valid topological order of the graph, such that we can transform the graph dependency into a linear sequence. In this example, we simply hard-code it: conststd::vector<std::string>nodes={"A","B","C"}; Define the Stage Function This particular workload does not propagate data directly through the pipeline. In most situations, data is directly stored in a custom graph data structure, and the stage function will just need to know which node to process. For demo's sake, we simply output a message to show which stage function is processing which node: //1st-stagefunction voidf1(conststd::string&node){ printf("f1(%s)\n",node.c_str()); } //2nd-stagefunction voidf2(conststd::string&node){ printf("f2(%s)\n",node.c_str()); } //3rd-stagefunction voidf3(conststd::string&node){ printf("f3(%s)\n",node.c_str()); } A key advantage of Taskflow's pipeline programming model is that we do not provide any data abstraction but give users full control over data management, which is typically application-dependent. In an application like this graph processing pipeline, data is managed in a global custom graph data structure, and any data abstraction provided by the library can become a unnecessary overhead. Define the Pipes The pipe structure is straightforward. Each pipe encapsulates the corresponding stage function and passes the node into the function argument. The first pipe will cease the pipeline scheduling when it has processed all nodes. To identify which node is being processed at a running pipe, we use tf::Pipeflow::token to find the index: //firstpipecallsf1 tf::Pipe{tf::PipeType::SERIAL,[&](tf::Pipeflow&pf){ if(pf.token()==nodes.size()){ pf.stop(); } else{ f1(nodes[pf.token()]); } }}, //secondpipecallsf2 tf::Pipe{tf::PipeType::SERIAL,[&](tf::Pipeflow&pf){ f2(nodes[pf.token()]); }}, //thirdpipecallsf3 tf::Pipe{tf::PipeType::SERIAL,[&](tf::Pipeflow&pf){ f3(nodes[pf.token()]); }} Define the Task Graph To build up the taskflow for the pipeline, we create a module task with the defined pipeline structure and connect it with two tasks that output helper messages before and after the pipeline: tf::Taskinit=taskflow.emplace([](){std::cout<<"ready\n";}) .name("startingpipeline"); tf::Tasktask=taskflow.composed_of(pl) .name("pipeline"); tf::Taskstop=taskflow.emplace([](){std::cout<<"stopped\n";}) .name("pipelinestopped"); init.precede(task); task.precede(stop); Submit the Task Graph Finally, we submit the taskflow to the execution and run it once: executor.run(taskflow).wait(); Three possible outputs are shown below: #possibleoutput1 ready f1(A) f2(A) f1(B) f2(B) f3(A) f1(C) f2(C) f3(B) f3(C) stopped #possibleoutput2 f1(A) f2(A) f3(A) f1(B) f2(B) f3(B) f1(C) f2(C) f3(C) stopped #possibleoutput3 ready f1(A) f2(A) f3(A) f1(B) f2(B) f1(C) f2(C) f3(B) f3(C) stopped Reference We have applied the graph processing pipeline technique to speed up a circuit analysis problem. Details can be referred to our publication below: Cheng-Hsiang Chiu and Tsung-Wei Huang, "Efficient Timing Propagation with Simultaneous Structural and Pipeline Parallelisms," ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, 2022