Even though shared address space machines have hardware support for coherence, getting good performance on these machines requires programmers to pay special attention to the memory hierarchy. Today, expert users restructure their codes and change their data structures manually to improve a program's locality of reference. This paper demonstrates that this optimization process can be automated. We have developed the first compiler system that fully automatically parallelizes sequential programs and changes the original array layouts. Our experimental results show that our algorithm can dramatically improve the parallel performance of shared address space machines.
The concepts described in this paper are useful for purposes other than translating sequential code to shared memory multiprocessors. Our algorithm to determine how to parallelize and distribute the computation and data is useful also to distributed address space machines. Our data transformation framework, consisting of the strip-mining and permuting primitives, is applicable to layout optimization for uniprocessors. Finally, our data transformation algorithm can also apply to HPF programs. While HPF directives are originally intended for distributed address space machines, our algorithm uses the information to make data accessed by each processor contiguous in the shared address space. In this way, the compiler achieves locality of reference, while taking advantage of the cache hardware to provide memory management and coherence functions.