matrix_multiplication Matrix Multiplication Problem Formulation matrix_multiplication_1MatrixMultiplicationProblem Parallel Patterns matrix_multiplication_1MatrixMultiplicationParallelPattern Benchmarking matrix_multiplication_1MatrixMultiplicationBenchmarking We study the classic problem, 2D matrix multiplication. We will start with a short introduction about the problem and then discuss how to solve it parallel CPUs. Problem Formulation We are multiplying two matrices, A (MxK) and B (KxN). The numbers of columns of A must match the number of rows of B. The output matrix C has the shape of (MxN) where M is the rows of A and N the columns of B. The following example multiplies a 3x3 matrix with a 3x2 matrix to derive a 3x2 matrix. As a general view, for each element of C we iterate a complete row of A and a complete column of B, multiplying each element and summing them. We can implement matrix multiplication using three nested loops. for(intm=0;m<M;m++){ for(intn=0;n<N;n++){ C[m][n]=0; for(intk=0;k<K;k++){ C[m][n]+=A[m][k]*B[k][n]; } } } Parallel Patterns At a fine-grained level, computing each element of C is independent of each other. Similarly, computing each row of C or each column of C is also independent of one another. With task parallelism, we prefer coarse-grained model to have each task perform rather large computation to amortize the overhead of creating and scheduling tasks. In this case, we avoid intensive tasks each working on only a single element. by creating a task per row of C to multiply a row of A by every column of B. //C=A*B //AisaMxKmatrix,BisaKxNmatrix,andCisaMxNmatrix voidmatrix_multiplication(int**A,int**B,int**C,intM,intK,intN){ tf::Taskflowtaskflow; tf::Executorexecutor; for(intm=0;m<M;++m){ taskflow.emplace([m,&](){ for(intn=0;n<N;n++){ for(intk=0;k<K;k++){ C[m][n]+=A[m][k]*B[k][n];//innerproduct } } }); } executor.run(taskflow).wait(); } Instead of creating tasks one-by-one over a loop, you can leverage Taskflow::for_each_index to create a parallel-for task. A parallel-for task spawns a subflow to perform parallel iterations over the given range. //performparalleliterationsontherange[0,M)withthestepsizeof1 tf::Tasktask=taskflow.for_each_index(0,M,1,[&](intm){ for(intn=0;n<N;n++){ for(intk=0;k<K;k++){ C[m][n]+=A[m][k]*B[k][n]; } } }); Please visit Parallel Iterations for more details. Benchmarking Based on the discussion above, we compare the runtime of computing various matrix sizes of A, B, and C between a sequential CPU and parallel CPUs on a machine of 12 Intel i7-8700 CPUs at 3.2 GHz. A B C CPU Sequential CPU Parallel 10x10 10x10 10x10 0.142 ms 0.414 ms 100x100 100x100 100x100 1.641 ms 0.733 ms 1000x1000 1000x1000 1000x1000 1532 ms 504 ms 2000x2000 2000x2000 2000x2000 25688 ms 4387 ms 3000x3000 3000x3000 3000x3000 104838 ms 16170 ms 4000x4000 4000x4000 4000x4000 250133 ms 39646 ms
The speed-up of parallel execution becomes clean as we increase the problem size. For example, at 4000x4000, the parallel runtime is 6.3 times faster than the sequential runtime.