This paper presents a fully automatic compiler that translates sequential code to efficient parallel code on shared address space machines. Here we address the memory hierarchy optimization problems that are specific to multiprocessors; the algorithm described in the paper can be followed with techniques that improve locality on uniprocessor code[9,13,34]. Uniprocessor cache optimization techniques are outside the scope of this paper.
We have developed an integrated compiler algorithm that selects the proper loops to parallelize, assigns the computation to the processors, and changes the array data layout, all with the overall goal of improving the memory subsystem performance. While loop transformation is relatively well understood, data transformation is not. In this paper, we show that various well-known data layouts can be derived as a combination of two simple primitives: strip-mining and permutation. Both of these transforms have a direct analog in the theory of loop transformations[6,35].
The techniques described in this paper are all implemented in the SUIF compiler system. Our compiler takes sequential C or FORTRAN programs as input and generates optimized SPMD (Single Program Multiple Data) C code fully automatically. We ran our compiler over a set of sequential FORTRAN programs and measured the performance of our parallelized code on the DASH multiprocessor. This paper also includes measurements and performance analysis on these programs.
The rest of the paper is organized as follows. We first present an overview of our algorithm and the rationale of the design in Section 2. We then describe the two main steps of our algorithm in Sections 3 and Section 4. We compare our approach to related work in Section 5. In Section 6 we evaluate the effectiveness of the algorithm by applying the compiler to a number of benchmarks. Finally, we present our conclusions in Section 7.