Tomcatv is a 200-line mesh generation program from the SPEC92 floating-point benchmark suite. Figure 13 shows the resulting speedups for each version of tomcatv.
Tomcatv contains several loop nests that have dependences across the rows of the arrays and other loop nests that have no dependences. Since the base version always parallelizes the outermost parallel loop, each processor accesses a block of array columns in the loop nests with no dependences. However, in the loop nests with row dependences, each processor accesses a block of array rows. As a result, there is little opportunity for data re-use across loop nests. Also, there is poor cache performance in the row-dependent loop nests because the data accessed by each processor is not contiguous in the shared address space.
The computation decomposition pass of the compiler selects a computation decomposition so that each processor always accesses a block of rows. The row-dependent loop nests still execute completely in parallel. This version of tomcatv exhibits good temporal locality; however, the speedups are still poor due to poor cache behavior. After transforming the data to make each processor's rows contiguous, the cache performance improves. Whereas the maximum speedup achieved by the base version is 5, the fully optimized tomcatv achieves a speedup of 18.
Figure 13: Tomcatv Speedups
: Summary of Experimental Results