Minimizing communication by increasing the locality of data references is an important optimization for achieving high performance on all large-scale parallel machines. The long message-passing overhead of multicomputer architectures, such as the Intel Touchstone, makes minimizing communication essential. Locality is also important to scalable machines that support a shared address space in hardware. For example, local cache accesses on the Stanford DASH shared-memory multiprocessor are two orders of magnitude faster than remote accesses. Improving locality can greatly enhance the performance of such machines.
The mapping of computation onto the processors of a parallel machine is termed the computation decomposition of the program. Similarly, the placement of data into the processors' local memories is called the data decomposition. This paper describes a compiler algorithm that automatically finds the computation and data decompositions that optimize both the parallelism and locality of a program. This algorithm is designed for use with both distributed and shared address space machines. For machines with a distributed address space, the compiler must follow this phase with a pass that maps the decomposition to explicit communication code. While it is not necessary to manage the memory directly for machines with a shared address space, many of the techniques used to manage data on distributed memory machines can be used to improve cache performance.
The choices of data and computation decomposition are inter-related; it is important to examine the opportunities for parallelism and the reuse of data to determine the decompositions. For example, if the only available parallelism in a computation lies in operating on different elements of an array simultaneously, then allocating those elements to the same processor renders the parallelism unusable. The data decomposition dictated by the available parallelism in one loop nest affects the decision of how to parallelize the next loop nest, and how to distribute the computation to minimize communication. It may be advantageous to abandon some parallelism to create larger granularity tasks if the communication cost overwhelms the benefit of parallelization.
A popular approach to this complex optimization problem is to solicit the programmer's help in determining the data decompositions. Projects using this approach include SUPERB, AL, ID Noveau, Kali, Vienna Fortran and Fortran D[14,33]. The current proposal for a High Performance Fortran extension to Fortran 90 also relies upon user-specified data decompositions. While these languages provide significant benefit to the programmer by eliminating the tedious job of managing the distributed memory explicitly, the programmer is still faced with a very difficult programming problem. The tight coupling between the mapping of data and computation means that the programmer must, in effect, also analyze the parallelization of the program when specifying the data decompositions. As the best decomposition may change based on the architecture of the machine, the programmer must fully master the machine details. Furthermore, the data decompositions may need to be modified to make the program run efficiently on a different architecture.
The goal of this research is to automatically derive the data and computation decompositions for the domain of dense matrix code where the loop bounds and array subscripts are affine functions of the loop indices and symbolic constants. Our algorithm finds decompositions for loops in which the number of iterations is much larger than the number of processors. The emphasis of this paper is on finding the first-order, or ``shape'', of the decompositions. We do not address issues such as load balancing, choosing the block size for a block-cyclic decomposition, determining the number of physical processors to lay out in each dimension, and fitting the computation and data to the exact number of physical processors. Even though these issues impact the performance of parallel machines, their effect is secondary and we do not address them in this paper.
We have developed a mathematical framework for expressing and calculating decompositions. This framework is general enough to handle a broad class of array access patterns, including array sections, and is also used to calculate the replication of read-only data. As there are many possible decompositions for a program, a systematic solution must successfully reduce this complex problem into a manageable one. Our model is based on the property that equivalent decompositions have the same data and computation allocated to a single processor. Once this aspect of the decomposition has been determined, we show that an assignment to specific processors can easily be calculated.
The cost of communication is determined by the data movement pattern. If the communication pattern is nearest-neighbor shifts of data, then the amount of data transferred can be significantly reduced by blocking. This form of communication is inexpensive compared to communication patterns that require general movement of the entire data structure (e.g. a transpose). We further differentiate between communication that occurs within a parallel loop with explicit synchronization, or across loops due to mismatches in decompositions. We call communication within a loop nest pipelined communication. Communication due to mismatches in decompositions, and that require moving the entire data structure, is called data reorganization communication. If a single data decomposition can be found for an array such that there is no reorganization communication in the program, then we consider that decomposition to be static (even though there may be some minor nearest-neighbor communication between parallel loop nests).
Section 2 briefly presents the background on optimizing parallelism and locality within a loop nest. We then introduce the issues involved in automatically calculating decompositions, and formulate the problem mathematically. Section 3 describes the components of a decomposition and gives an overview of our approach. To illustrate the basic ideas behind our decomposition model, we first discuss a simplified subproblem in Section 4. We present an algorithm that finds data and computation decompositions that have neither data reorganization nor pipelined communication. We then reapply the concepts to find decompositions with pipelined communication in Section 5. Section 6 uses the algorithms in Section 4 and 5 as building blocks to develop an algorithm that takes into account both pipelined and data reorganization communication. Section 7 presents additional techniques for handling replication, and for minimizing the number of idle processors and the amount of replication. We have implemented the algorithms described in this paper in the SUIF compiler at Stanford. Section 8 describes some experimental results using the compiler. Section 9 discusses related work, and we conclude in Section 10 with a summary of the contributions of this paper.