Text Processing Pipeline

TextProcessingPipeline Text Processing Pipeline Formulate the Text Processing Pipeline Problem TextProcessingPipeline_1FormulateTheTextProcessingPipelineProblem Create a Text Processing Pipeline TextProcessingPipeline_1CreateAParallelTextPipeline Define the Data Buffer TextProcessingPipeline_1TextPipelineDefineTheDataBuffer Define the Pipes TextProcessingPipeline_1TextPipelineDefineThePipes Define the Task Graph TextProcessingPipeline_1TextPipelineDefineTheTaskGraph Submit the Task Graph TextProcessingPipeline_1TextPipelineSubmitTheTaskGraph We study a text processing pipeline that finds the most frequent character of each string from an input source. Parallelism exhibits in the form of a three-stage pipeline that transforms the input string to a final pair type. Formulate the Text Processing Pipeline Problem Given an input vector of strings, we want to compute the most frequent character for each string using a series of transform operations. For example: #inputstrings abade ddddf eefge xyzzd ijjjj jiiii kkijk #output a:2 d:4 e:3 z:2 j:4 i:4 k:3 We decompose the algorithm into three stages: read a std::string from the input vector generate a std::unorder_map<char, size_t> frequency map from the string reduce the most frequent character to a std::pair<char, size_t> from the map The first and the third stages process inputs and generate results in serial, and the second stage can run in parallel. The algorithm is a perfect fit to pipeline parallelism, as different stages can overlap with each other in time across parallel lines. Create a Text Processing Pipeline We create a pipeline of three pipes (stages) and two parallel lines to solve the problem. The number of parallel lines is a tunable parameter. In most cases, we can just use std::thread::hardware_concurrency as the line count. The first pipe reads an input string from the vector in order, the second pipe transforms the input string from the first pipe to a frequency map in parallel, and the third pipe reduces the frequency map to find the most frequent character. The overall implementation is shown below: #include<taskflow/taskflow.hpp> #include<taskflow/algorithm/pipeline.hpp> //Function:formatthemap std::stringformat_map(conststd::unordered_map<char, size_t>&map){ std::ostringstreamoss; for(constauto&[i,j]:map){ oss<<i<<':'<<j<<''; } returnoss.str(); } intmain(){ tf::Taskflowtaskflow("text-filterpipeline"); tf::Executorexecutor; constsize_tnum_lines=2; //inputdata std::vector<std::string>input={ "abade", "ddddf", "eefge", "xyzzd", "ijjjj", "jiiii", "kkijk" }; //customdatastorage usingdata_type=std::variant< std::string,std::unordered_map<char, size_t>,std::pair<char, size_t> >; std::array<data_type, num_lines>mybuffer; //thepipelineconsistsofthreepipes(serial-parallel-serial) //anduptotwoconcurrentschedulingtokens tf::Pipelinepl(num_lines, //firstpipeprocessestheinputdata tf::Pipe{tf::PipeType::SERIAL,[&](tf::Pipeflow&pf){ if(pf.token()==input.size()){ pf.stop(); } else{ printf("stage1:inputtoken=%s\n",input[pf.token()].c_str()); mybuffer[pf.line()]=input[pf.token()]; } }}, //secondpipecountsthefrequencyofeachcharacter tf::Pipe{tf::PipeType::PARALLEL,[&](tf::Pipeflow&pf){ std::unordered_map<char, size_t>map; for(autoc:std::get<std::string>(mybuffer[pf.line()])){ map[c]++; } printf("stage2:map=%s\n",format_map(map).c_str()); mybuffer[pf.line()]=map; }}, //thirdpipereducesthemostfrequentcharacter tf::Pipe{tf::PipeType::SERIAL,[&mybuffer](tf::Pipeflow&pf){ auto&map=std::get<std::unordered_map<char,size_t>>(mybuffer[pf.line()]); autosol=std::max_element(map.begin(),map.end(),[](auto&a,auto&b){ returna.second<b.second; }); printf("stage3:%c:%zu\n",sol->first,sol->second); //notnecessarytostorethelast-stagedata,justfordemopurpose mybuffer[pf.line()]=*sol; }} ); //buildthepipelinegraphusingcomposition tf::Taskinit=taskflow.emplace([](){std::cout<<"ready\n";}) .name("startingpipeline"); tf::Tasktask=taskflow.composed_of(pl) .name("pipeline"); tf::Taskstop=taskflow.emplace([](){std::cout<<"stopped\n";}) .name("pipelinestopped"); //createtaskdependency init.precede(task); task.precede(stop); //dumpthepipelinegraphstructure(withcomposition) taskflow.dump(std::cout); //runthepipeline executor.run(taskflow).wait(); return0; } Define the Data Buffer Taskflow does not provide any data abstraction to perform pipeline scheduling, but give users full control over data management in their applications. In this example, we create an one-dimensional buffer of a std::variant data type to store the output of each pipe in a uniform storage: usingdata_type=std::variant< std::string,std::unordered_map<char, size_t>,std::pair<char, size_t> >; std::array<std::array<data_type, num_pipes>,num_lines>mybuffer; One-dimensional buffer is sufficient because Taskflow enables only one scheduling token per line at a time. Define the Pipes The first pipe reads one string and puts it in the corresponding entry at the buffer, mybuffer[pf.line()]. Since we read in each string in order, we declare the pipe as a serial type: tf::Pipe{tf::PipeType::SERIAL,[&](tf::Pipeflow&pf){ if(pf.token()==input.size()){ pf.stop(); } else{ mybuffer[pf.line()]=input[pf.token()]; printf("stage1:inputtoken=%s\n",input[pf.token()].c_str()); } }}, The second pipe needs to get the input string from the previous pipe and then transforms that input string into a frequency map that records the occurrence of each character in the string. As multiple transforms can operate simultaneously, we declare the pipe as a parallel type: tf::Pipe{tf::PipeType::PARALLEL,[&](tf::Pipeflow&pf){ std::unordered_map<char, size_t>map; for(autoc:std::get<std::string>(mybuffer[pf.line()])){ map[c]++; } mybuffer[pf.line()]=map; printf("stage2:map=%s\n",format_map(map).c_str()); }} Similarly, the third pipe needs to get the input frequency map from the previous pipe and then reduces the result to find the most frequent character. We may not need to store the result in the buffer but other places defined by the application (e.g., an output file). As we want to output the result in the same order as the input, we declare the pipe as a serial type: tf::Pipe{tf::PipeType::SERIAL,[&mybuffer](tf::Pipeflow&pf){ auto&map=std::get<std::unordered_map<char,size_t>>(mybuffer[pf.line()]); autosol=std::max_element(map.begin(),map.end(),[](auto&a,auto&b){ returna.second<b.second; }); printf("stage3:%c:%zu\n",sol->first,sol->second); }} Define the Task Graph To build up the taskflow graph for the pipeline, we create a module task out of the pipeline structure and connect it with two tasks that outputs messages before and after the pipeline: tf::Taskinit=taskflow.emplace([](){std::cout<<"ready\n";}) .name("startingpipeline"); tf::Tasktask=taskflow.composed_of(pl) .name("pipeline"); tf::Taskstop=taskflow.emplace([](){std::cout<<"stopped\n";}) .name("pipelinestopped"); init.precede(task); task.precede(stop); Submit the Task Graph Finally, we submit the taskflow to the execution and run it once: executor.run(taskflow).wait(); As the second stage is a parallel pipe, the output may interleave. One possible result is shown below: ready stage1:inputtoken=abade stage1:inputtoken=ddddf stage2:map=f:1d:4 stage2:map=e:1d:1a:2b:1 stage3:a:2 stage1:inputtoken=eefge stage2:map=g:1e:3f:1 stage3:d:4 stage1:inputtoken=xyzzd stage3:e:3 stage1:inputtoken=ijjjj stage2:map=z:2x:1d:1y:1 stage3:z:2 stage1:inputtoken=jiiii stage2:map=j:4i:1 stage3:j:4 stage2:map=i:4j:1 stage1:inputtoken=kkijk stage3:i:4 stage2:map=j:1k:3i:1 stage3:k:3 stopped We can see seven outputs at the third stage that show the most frequent character for each of the seven strings in order (a:2, d:4, e:3, z:2, j:4, i:4, k:3). The taskflow graph of this pipeline workload is shown below: