Vpenta is one of the kernels in nasa7, a program in the SPEC92 floating-point benchmark suite. This kernel simultaneously inverts three pentadiagonal matrices. The performance results are shown in Figure 4. The base compiler interchanges the loops in the original code so that the outer loop is parallelizable and the inner loop carries spatial locality. Without such optimizations, the program would not even get the slight speedup obtained with the base compiler.
Figure 4: Vpenta Speedups
For this particular program, the base compiler's parallelization scheme is the same as the results from the global analysis in our computation decomposition algorithm. However, since the compiler can determine that each processor accesses exactly the same partition of the arrays across the loops, the code generator can eliminate barriers between some of the loops. This accounts for the slight increase in performance of the computation decomposition version over the base compiler.
This program operates on a set of two-dimensional and three-dimensional arrays. Each processor accesses a block of columns for the two-dimensional arrays, thus no data reorganization is necessary for these arrays. However, each plane of the three-dimensional array is partitioned into blocks of rows, each of which is accessed by a different processor. This presents an opportunity for our compiler to change the data layout and make the data accessed contiguous on each processor. With the improved data layout, the program finally runs with a decent speedup. We observe that the performance dips slightly when there are about 16 processors, and drops significantly when there are 32 processors. This performance degradation is due to increased cache conflicts among accesses within the same processor. Further data and computation optimizations that focus on operations on the same processor would be useful.