A summary of the results is presented in Table 1. For each program we compare the speedups on 32 processors obtained with the base compiler against the speedups obtained with all the optimizations turned on. We also indicate whether computation decomposition and data decomposition optimizations are critical to the improved performance. Finally, we list the data decompositions found for the major arrays in the program. Unless otherwise noted, the other arrays in the program were aligned with the listed array of the same dimensionality.
Our experimental results demonstrate that there is a need for memory optimizations on shared address space machines. The programs in our application suite are all highly parallelizable, but their speedups on a 32-processor machine are rather moderate, ranging from 4 to 20. Our compiler finds many opportunities for improvement; the data and computation decompositions are often different from the conventional or that obtained via local analysis. Finally, the results show that our algorithm is effective. The same set of programs now achieve 14 to 34-fold speedup on a 32-processor machine.