release-3-6-0 Release 3.6.0 (2023/05/07) Download release-3-6-0_1release-3-6-0_download System Requirements release-3-6-0_1release-3-6-0_system_requirements Release Summary release-3-6-0_1release-3-6-0_summary New Features release-3-6-0_1release-3-6-0_new_features Taskflow Core release-3-6-0_1release-3-6-0_taskflow_core cudaFlow release-3-6-0_1release-3-6-0_cudaflow Utilities release-3-6-0_1release-3-6-0_utilities Taskflow Profiler (TFProf) release-3-6-0_1release-3-6-0_profiler Bug Fixes release-3-6-0_1release-3-6-0_bug_fixes Breaking Changes release-3-6-0_1release-3-6-0_breaking_changes Documentation release-3-6-0_1release-3-6-0_documentation Miscellaneous Items release-3-6-0_1release-3-6-0_miscellaneous_items Taskflow 3.6.0 is the 7th release in the 3.x line! This release includes several new changes, such as dynamic task graph parallelism, improved parallel algorithms, modified GPU tasking interface, documentation, examples, and unit tests. Download Taskflow 3.6.0 can be downloaded from here. System Requirements To use Taskflow v3.6.0, you need a compiler that supports C++17: GNU C++ Compiler at least v8.4 with -std=c++17 Clang C++ Compiler at least v6.0 with -std=c++17 Microsoft Visual Studio at least v19.27 with /std:c++17 AppleClang Xcode Version at least v12.0 with -std=c++17 Nvidia CUDA Toolkit and Compiler (nvcc) at least v11.1 with -std=c++17 Intel C++ Compiler at least v19.0.1 with -std=c++17 Intel DPC++ Clang Compiler at least v13.0.0 with -std=c++17 and SYCL20 Taskflow works on Linux, Windows, and Mac OS X. Release Summary This release contains several changes to largely enhance the programmability of GPU tasking and standard parallel algorithms. More importantly, we have introduced a new dependent asynchronous tasking model that offers great flexibility for expressing dynamic task graph parallelism. New Features Taskflow Core Added new async methods to support dynamic task graph creation tf::Executor::dependent_async(F&& func, Tasks&&... tasks) tf::Executor::dependent_async(F&& func, I first, I last) tf::Executor::silent_dependent_async(F&& func, Tasks&&... tasks) tf::Executor::silent_dependent_async(F&& func, I first, I last) Added new async and join methods to tf::Runtime tf::Runtime::async tf::Runtime::silent_async tf::Runtime::corun_all Added a new partitioner interface to optimize parallel algorithms tf::GuidedPartitioner tf::StaticPartitioner tf::DynamicPartitioner tf::RandomPartitioner Added parallel-scan algorithms to Taskflow tf::Taskflow::inclusive_scan(B first, E last, D d_first, BOP bop) tf::Taskflow::inclusive_scan(B first, E last, D d_first, BOP bop, T init) tf::Taskflow::transform_inclusive_scan(B first, E last, D d_first, BOP bop, UOP uop) tf::Taskflow::transform_inclusive_scan(B first, E last, D d_first, BOP bop, UOP uop, T init) tf::Taskflow::exclusive_scan(B first, E last, D d_first, T init, BOP bop) tf::Taskflow::transform_exclusive_scan(B first, E last, D d_first, T init, BOP bop, UOP uop) Added parallel-find algorithms to Taskflow tf::Taskflow::find_if(B first, E last, T& result, UOP predicate, P&& part) tf::Taskflow::find_if_not(B first, E last, T& result, UOP predicate, P&& part) tf::Taskflow::min_element(B first, E last, T& result, C comp, P&& part) tf::Taskflow::max_element(B first, E last, T& result, C comp, P&& part) Modified tf::Subflow as a derived class from tf::Runtime Extended parallel algorithms to support different partitioning algorithms tf::Taskflow::for_each_index(B first, E last, S step, C callable, P&& part) tf::Taskflow::for_each(B first, E last, C callable, P&& part) tf::Taskflow::transform(B first1, E last1, O d_first, C c, P&& part) tf::Taskflow::transform(B1 first1, E1 last1, B2 first2, O d_first, C c, P&& part) tf::Taskflow::reduce(B first, E last, T& result, O bop, P&& part) tf::Taskflow::transform_reduce(B first, E last, T& result, BOP bop, UOP uop, P&& part) Improved the performance of tf::Taskflow::sort for plain-old-data (POD) type Extended task-parallel pipeline to handle token dependencies Task-parallel Pipeline with Token Dependencies cudaFlow removed algorithms that require buffer from tf::cudaFlow due to update limitation removed support for a dedicated cudaFlow task in Taskflow all usage of tf::cudaFlow and tf::cudaFlowCapturer are standalone now Utilities Added all_same templates to check if a parameter pack has the same type Taskflow Profiler (TFProf) Removed cudaFlow and syclFlow tasks Bug Fixes Fixed the compilation error caused by clashing MAX_PRIORITY wtih winspool.h (#459) Fixed the compilation error caused by tf::TaskView::for_each_successor and tf::TaskView::for_each_dependent Fixed the infinite-loop bug when corunning a module task from tf::Runtime If you encounter any potential bugs, please submit an issue at issue tracker. Breaking Changes Dropped support for cancelling asynchronous tasks //previous-nolongersupported tf::Future<int>fu=executor.async([](){ return1; }); fu.cancel(); std::optional<int>res=fu.get();//resmaybestd::nulloptor1 //now-usestd::futureinstead std::future<int>fu=executor.async([](){ return1; }); intres=fu.get(); Dropped in-place support for running tf::cudaFlow from a dedicated task //previous-nolongersupported taskflow.emplace([](tf::cudaFlow&cf){ cf.offload(); }); //now-usertofullycontroltf::cudaFlowformaximumflexibility taskflow.emplace([](){ tf::cudaFlowcf; //offloadthecudaflowasynchronouslythroughastream tf::cudaStreamstream; cf.run(stream); //waitforthecudaflowcompletes stream.synchronize(); }); Dropped in-place support for running tf::cudaFlowCapturer from a dedicated task //previous-nowlongersupported taskflow.emplace([](tf::cudaFlowCapturer&cf){ cf.offload(); }); //now-usertofullycontroltf::cudaFlowCapturerformaximumflexibility taskflow.emplace([](){ tf::cudaFlowCapturercf; //offloadthecudaflowasynchronouslythroughastream tf::cudaStreamstream; cf.run(stream); //waitforthecudaflowcompletes stream.synchronize(); }); Dropped in-place support for running tf::syclFlow from a dedicated task SYCL can just be used out of box together with Taskflow Move all buffer query methods of CUDA standard algorithms inside execution policy tf::cudaExecutionPolicy<NT, VT>::reduce_bufsz tf::cudaExecutionPolicy<NT, VT>::scan_bufsz tf::cudaExecutionPolicy<NT, VT>::merge_bufsz tf::cudaExecutionPolicy<NT, VT>::min_element_bufsz tf::cudaExecutionPolicy<NT, VT>::max_element_bufsz //previous-nolongersupported tf::cuda_reduce_buffer_size<tf::cudaDefaultExecutionPolicy,int>(N); //now(andsimilarlyforotherparallelalgorithms) tf::cudaDefaultExecutionPolicypolicy(stream); policy.reduce_bufsz<int>(N); Renamed tf::Executor::run_and_wait to tf::Executor::corun for expressiveness Renamed tf::Executor::loop_until to tf::Executor::corun_until for expressiveness Renamed tf::Runtime::run_and_wait to tf::Runtime::corun for expressiveness Disabled argument support for all asynchronous tasking features users are responsible for creating their own wrapper to make the callable //previous-asyncallowspassingargumentstothecallable executor.async([](inti){std::cout<<i<<std::endl;},4); //now-usersareresponsibleofwrappingthearumgnetsintoacallable executor.async([i=4](std::cout<<i<<std::endl;){}); Replaced named_async with an overload that takes the name string on the first argument //previous-explicitlycallingnamed_asynctoassignanametoanasynctask executor.named_async("name",[](){}); //now-overlaod executor.async("name",[](){}); Documentation Revised Request Cancellation to remove support of cancelling async tasks Revised Asynchronous Tasking to include asynchronous tasking from tf::Runtime Launch Asynchronous Tasks from a Runtime Revised Taskflow algorithms to include execution policy Partitioning Algorithm Parallel Iterations Parallel Transforms Parallel Reduction Revised CUDA standard algorithms to correct the use of buffer query methods Parallel Reduction Parallel Find Parallel Merge Parallel Scan Added Task-parallel Pipeline with Token Dependencies Added Parallel Scan Added Asynchronous Tasking with Dependencies Miscellaneous Items We have published Taskflow in the following venues: Dian-Lun Lin, Yanqing Zhang, Haoxing Ren, Shih-Hsin Wang, Brucek Khailany and Tsung-Wei Huang, "GenFuzz: GPU-accelerated Hardware Fuzzing using Genetic Algorithm with Multiple Inputs," ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, 2023 Tsung-Wei Huang, "qTask: Task-parallel Quantum Circuit Simulation with Incrementality," IEEE International Parallel and Distributed Processing Symposium (IPDPS), St. Petersburg, Florida, 2023 Elmir Dzaka, Dian-Lun Lin, and Tsung-Wei Huang, "Parallel And-Inverter Graph Simulation Using a Task-graph Computing System," IEEE International Parallel and Distributed Processing Symposium Workshop (IPDPSW), St. Petersburg, Florida, 2023 Please do not hesitate to contact Dr. Tsung-Wei Huang if you intend to collaborate with us on using Taskflow in your scientific computing projects.