Erlebacher is a 600-line FORTRAN benchmark from ICASE that performs three-dimensional tridiagonal solves. It includes a number of fully parallel computations, interleaved with multi-dimensional reductions and computational wavefronts in all three dimensions caused by forward and backward substitutions. Partial derivatives are computed in all three dimensions with three-dimensional arrays. Figure 11 shows the resulting speedups for each version of Erlebacher.
Figure 11: Erlebacher Speedups
The base-line version always parallelizes the outermost parallel loop. This strategy yields local accesses in the first two phases of Erlebacher when computing partial derivatives in the X and Y dimensions, but ends up causing non-local accesses in the Z dimension.
The computation decomposition algorithm improves the performance of Erlebacher slightly over the base-line version. It finds a computation decomposition so that no non-local accesses are needed in the Z dimension. The major data structures in the program are the input array and DUX, DUY and DUZ which are used to store the partial derivatives in the X, Y and Z dimensions, respectively. Since it is only written once, the input array is replicated. Each processor accesses a block of columns for arrays DUX and DUY, and a block of rows for array DUZ. Thus in this version of the program, DUZ has poor spatial locality. The data transformation phase of the compiler restructures DUZ so that local references are contiguous in memory. Because two-thirds of the program is perfectly parallel with all local accesses, the optimizations only realize a modest performance improvement.