As discussed in Section 1.1, memory hierarchy considerations permeate many aspects of the compilation process: how to parallelize the code, how to distribute the parallel computation across the processors and how to lay out the arrays in memory. By analyzing this complex problem carefully, we are able to partition the problem into the following two subproblems.
Our first goal is to find a parallelization scheme that incurs minimum synchronization and communication cost, without regard to the original layout of the data structures. This step of our algorithm determines how to assign the parallel computation to processors, and also what data are used by each processor. In this paper, we will refer to the former as computation decomposition and the latter as data decomposition.
We observe that efficient hand-parallelized codes synchronize infrequently and have little true sharing between processors[27]. Thus our algorithm is designed to find a parallelization scheme that requires no communication whenever possible. Our analysis starts with the model that all loop iterations and array elements are distributed. As the analysis proceeds, the algorithm groups together computation that uses the same data. Communication is introduced at the least executed parts of the code only to avoid collapsing all parallelizable computation onto one processor. In this way, the algorithm groups the computation and data into the largest possible number of (mostly) disjoint sets. By assigning one or more such sets of data and computation to processors, the compiler finds a parallelization scheme, a computation decomposition and a data decomposition that minimize communication.
While the data decompositions generated in the above step have traditionally been used to manage data on distributed address space machines, this information is also very useful for shared address space machines. Once the code has been parallelized in the above manner, it is clear that we need only to improve locality among accesses to each set of data assigned to a processor. As discussed in Section 1.1, multiprocessors have especially poor cache performances because the processors often operate on disconnected regions of the address space. We can change this characteristic by making the data accessed by each processor contiguous, using the data decomposition information generated in the first step.
The algorithm we developed for this subproblem is rather straightforward. The algorithm derives the desired data layout by systematically applying two simple data transforms: strip-mining and permutation. The algorithm also applies several novel optimizations to improve the address calculations for transformed arrays.