mesytec-mnode/external/taskflow-3.8.0/docs/xml/opentimer.xml

<?xml version='1.0' encoding='UTF-8' standalone='no'?>
<doxygen xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="compound.xsd" version="1.9.1" xml:lang="en-US">
  <compounddef id="opentimer" kind="page">
    <compoundname>opentimer</compoundname>
    <title>Static Timing Analysis</title>
    <tableofcontents>
      <tocsect>
        <name>OpenTimer: A High-performance Timing Analysis Tool</name>
        <reference>opentimer_1UseCasesOpenTimer</reference>
    </tocsect>
      <tocsect>
        <name>Programming Effort</name>
        <reference>opentimer_1UseCaseOpenTimerProgrammingEffort</reference>
    </tocsect>
      <tocsect>
        <name>Performance Improvement</name>
        <reference>opentimer_1UseCaseOpenTimerPerformanceImprovement</reference>
    </tocsect>
      <tocsect>
        <name>Conclusion</name>
        <reference>opentimer_1UseCaseOpenTimerConclusion</reference>
    </tocsect>
      <tocsect>
        <name>References</name>
        <reference>opentimer_1UseCaseOpenTimerReferences</reference>
    </tocsect>
    </tableofcontents>
    <briefdescription>
    </briefdescription>
    <detaileddescription>
<para>We have applied Taskflow to solve a real-world VLSI static timing analysis problem that incorporates hundreds of millions of tasks and dependencies. The goal is to analyze the timing behavior of a design.</para>
<sect1 id="opentimer_1UseCasesOpenTimer">
<title>OpenTimer: A High-performance Timing Analysis Tool</title>
<para>Static timing analysis (STA) is an important step in the overall chip design flow. It verifies the static behavior of a circuit design and ensure its correct functionality under the given clock speed. However, efficient parallel timing analysis is extremely challenging to design and implement, due to large irregularity and graph-oriented computing. The following figure shows an extracted timing graph from an industrial design.</para>
<para><image type="html" name="opentimer_1.png"></image>
</para>
<para>We consider our research project <ulink url="https://github.com/OpenTimer/OpenTimer">OpenTimer</ulink>, an open-source static timing analyzer that has been used in many industrial and academic projects. The first release v1 in 2015 implemented the <emphasis>pipeline-based levelization</emphasis> algorithm using the OpenMP 4.5 task dependency clause. To overcome the performance bottleneck caused by pipeline, we rewrote the core incremental timing engine using Taskflow in the second release v2.</para>
</sect1>
<sect1 id="opentimer_1UseCaseOpenTimerProgrammingEffort">
<title>Programming Effort</title>
<para>The table below measures the software costs of two OpenTimer versions using the Linux tool <ulink url="https://dwheeler.com/sloccount/">SLOCCount</ulink>. In OpenTimer v2, a large amount of exhaustive OpenMP dependency clauses that were used to carry out task dependencies are now replaced with only a few lines of flexible Taskflow code (9123 vs 4482). The maximum cyclomatic complexity in a single function is reduced from 58 to 20, due to Taskflow&apos;s programmability.</para>
<para> <table rows="3" cols="5"><row>
<entry thead="yes" align='center'><para>Tool   </para>
</entry><entry thead="yes" align='center'><para>Task Model   </para>
</entry><entry thead="yes" align='center'><para>Lines of Code   </para>
</entry><entry thead="yes" align='center'><para>Cyclomatic Complexity   </para>
</entry><entry thead="yes" align='center'><para>Cost    </para>
</entry></row>
<row>
<entry thead="no" align='center'><para>OpenTimer v1   </para>
</entry><entry thead="no" align='center'><para>OpenMP 4.5   </para>
</entry><entry thead="no" align='center'><para>9123   </para>
</entry><entry thead="no" align='center'><para>58   </para>
</entry><entry thead="no" align='center'><para>$275,287    </para>
</entry></row>
<row>
<entry thead="no" align='center'><para>OpenTimer v2   </para>
</entry><entry thead="no" align='center'><para>Taskflow   </para>
</entry><entry thead="no" align='center'><para>4482   </para>
</entry><entry thead="no" align='center'><para>20   </para>
</entry><entry thead="no" align='center'><para>$130,523   </para>
</entry></row>
</table>
</para>
<para>OpenTimer v1 relied on a pipeline data structure to adtop loop parallelism with OpenMP. We found it very difficult to go beyond this paradigm because of the insufficient support for dynamic dependencies in OpenMP. With Taskflow in place, we can break this bottleneck and easily model both static and dynamic task dependencies at programming time and runtime. The task dependency graph flows computations naturally with the timing graph, providing improved asynchrony and performance. The following figure shows a task graph to carry one iteration of timing update.</para>
<para><dotfile name="/home/thuang295/Code/taskflow/doxygen/images/opentimer_4.dot"></dotfile>
</para>
</sect1>
<sect1 id="opentimer_1UseCaseOpenTimerPerformanceImprovement">
<title>Performance Improvement</title>
<para>We compare the performance between OpenTimer v1 and v2. We evaluated the runtime versus incremental iterations under 16 CPUs on two industrial circuit designs tv80 (5.3K gates and 5.3K nets) and vga_lcd (139.5K gates and 139.6K nets) with 45nm NanGate cell libraris. Each incremental iteration refers a design modification followed by a timing query to trigger a timing update. In v1, this includes the time to reconstruct the data structure required by OpenMP to alter the task dependencies. In v2, this includes the time to create and launch a new task dependency graph</para>
<para><image type="html" name="opentimer_2.png"></image>
</para>
<para>The scalability of Taskflow is shown in the Figure below. We used two million-scale designs, netcard (1.4M gates) and leon3mp (1.2M gates) to evaluate the runtime of v1 and v2 across different number of CPUs. There are two important observations. First, v2 is slightly slower than v1 at one CPU (3-4%), where all OpenMP&apos;s constructs are literally disabled. This shows the graph overhead of Taskflow; yet it is negligible. Second, v2 is consistently faster than v1 regardless of CPU numbers except one. This highlights that Taskflow&apos;s programming model largely improves the design of a parallel VLSI timing engine that would otherwise not be possible with OpenMP.</para>
<para><image type="html" name="opentimer_3.png"></image>
</para>
</sect1>
<sect1 id="opentimer_1UseCaseOpenTimerConclusion">
<title>Conclusion</title>
<para>Programming models matter. Different models give different implementations. The parallel code sections may run fast, yet the data structures to support a parallel decomposition strategy may outweigh its parallelism benefits. In OpenTimer v1, loop-based OpenMP code is very fast. But it&apos;s too costly to maintain the pipeline data structure over iterations.</para>
</sect1>
<sect1 id="opentimer_1UseCaseOpenTimerReferences">
<title>References</title>
<para><itemizedlist>
<listitem><para>Tsung-Wei Huang, Guannan Guo, Chun-Xun Lin, and Martin Wong, "<ulink url="https://tsung-wei-huang.github.io/papers/tcad21-ot2.pdf">OpenTimer v2: A New Parallel Incremental Timing Analysis Engine</ulink>," <emphasis>IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD)</emphasis>, vol. 40, no. 4, pp. 776-786, April 2021 </para>
</listitem>
<listitem><para>Tsung-Wei Huang, Chun-Xun Lin, Guannan Guo, and Martin Wong, "<ulink url="ipdps19.pdf">Cpp-Taskflow: Fast Task-based Parallel Programming using Modern C++</ulink>," <emphasis>IEEE International Parallel and Distributed Processing Symposium (IPDPS)</emphasis>, pp. 974-983, Rio de Janeiro, Brazil, 2019. </para>
</listitem>
<listitem><para>Tsung-Wei Huang and Martin Wong, "<ulink url="huang_15_01.pdf">OpenTimer: A High-Performance Timing Analysis Tool</ulink>," <emphasis>IEEE/ACM International Conference on Computer-Aided Design (ICCAD)</emphasis>, pp. 895-902, Austin, TX, 2015. </para>
</listitem>
</itemizedlist>
</para>
</sect1>
    </detaileddescription>
    <location file="doxygen/usecases/opentimer.dox"/>
  </compounddef>
</doxygen>