Jennifer M. Anderson and Monica S. Lam
Computer Systems Laboratory
Stanford University, CA 94305
This research was supported in part by DARPA contract N00039-91-C-0138, an NSF Young Investigator Award and a fellowship from Digital Equipment Corporation's Western Research Lab.
In Proceedings of SIGPLAN '93 Conference on Programming Language Design and Implementation (PLDI) Albuquerque, New Mexico, June 23--25, 1993
This paper describes a compiler algorithm that automatically finds computation and data decompositions that optimize both parallelism and locality. This algorithm is designed for use with both distributed and shared address space machines. The scope of our algorithm is dense matrix computations where the array accesses are affine functions of the loop indices. Our algorithm can handle programs with general nestings of parallel and sequential loops.
We present a mathematical framework that enables us to systematically derive the decompositions. Our algorithm can exploit parallelism in both fully parallelizable loops as well as loops that require explicit synchronization. The algorithm will trade off extra degrees of parallelism to eliminate communication. If communication is needed, the algorithm will try to introduce the least expensive forms of communication into those parts of the program that are least frequently executed.