Previous work on compiler algorithms for optimizing memory hierarchy performance has focused primarily on loop transformations. Unimodular loop transformations, loop fusion and loop nest blocking restructure computation to increase uniprocessor cache re-use[9,13,34]. Copying data into contiguous regions has been studied as a means for reducing cache interference[23,29].
Several researchers have proposed algorithms to transform computation and data layouts to improve memory system performance[10,20]. The same optimizations are intended to change the data access patterns to improve locality on both uniprocessors and shared address space multiprocessors. For the multiprocessor case, they assume that decision of which loops to parallelize has already been determined. In contrast, the algorithm described in this paper concentrates directly on multiprocessor memory system performance. We globally analyze the program and explicitly determine which loops to parallelize so that data are re-used by the same processor as much as possible. Our experimental results show that there is often a choice of parallelization across loop nests, and that this decision significantly impacts the performance of the resulting program.
The scope of the data transformations used in the previous work are array permutations only; they do not consider strip-mining. By using strip-mining in combination with permutation, our compiler is able to optimize spatial locality by making the data used by each processor contiguous in the shared address space. This means, for example, that our compiler can achieve good cache performance by creating cyclic and multi-dimensional blocked distributions.
Previous approaches use search-based algorithms to select a combination of data and computation transformations that result in good cache performance. Instead we partition the problem into two well-defined subproblems. The first step minimizes communication and synchronization without regard to the data layout. The second step then simply makes the data accessed by each processor contiguous. After the compiler performs the optimizations described in this paper, code optimizations for uniprocessor cache performance can then be applied.
Compile-time data transformations have also been used to eliminate false-sharing in explicitly parallel C code. The domain of that work is quite different from ours; we consider both data and computation transformations, and the code is parallelized automatically. Their compiler statically analyzes the program to determine the data accessed by each processor, and then try to group the data together. Two different transformations are used to aggregate the data. First, their compiler turn groups of vectors that are accessed by different processors into an array of structures. Each structure contains the aggregated data accessed by a single processor. Second, their compiler moves shared data into memory that is allocated local to each processor. References to the original data structures are replaced with pointers to the newly allocated per-processor data structures. Lastly, their compiler can also pad data structures that have no locality (e.g. locks) to avoid false-sharing.