The first step of the algorithm is to analyze each loop nest individually and restructure the loop via unimodular transformations to expose the largest number of outermost parallelizable loops. This is just a preprocessing step, and we do not make any decision on which loops to parallelize yet.
The second step attempts to find affine mappings of data and computation onto the virtual processor space that maximize parallelism without requiring any major communication. A necessary and sufficient condition for a processor assignment to incur no communication cost is that each processor must access only data that are local to the processor. Let be a function mapping iterations in loop nest j to the processors, and let be a function mapping elements of array x to the processors. Let be a reference to array x in loop j. No communication is necessary iff
The algorithm tries to find decompositions that satisfy this equation. The objective is to maximize the rank of the linear transformations, as the rank corresponds to the degree of parallelism in the program.
Our algorithm tries to satisfy Equation 1 incrementally, starting with the constraints among the more frequently executed loops and finally to the less frequently executed loops. If it is not possible to find non-trivial decompositions that satisfy the equation, we introduce communication. Read-only and seldom-written data can be replicated; otherwise we generate different data decompositions for different parts of the program. This simple, greedy approach tends to place communication in the least executed sections of the code.
The third step is to map the virtual processor space onto the physical processor space. At this point, the algorithm has mapped the computation to an virtual processor space, where d is the degree of parallelism in the code, and n is the number of iterations in the loops (assuming that the number of iterations is the same across the loops). It is well known that parallelizing as many dimensions of loops as possible tends to decrease the communication to computation ratio. Thus, by default, our algorithm partitions the virtual processor space into d-dimensional blocks and assigns a block to each physical processor. If we observe that the computation of an iteration in a parallelized loop either decreases or increases with the iteration number, we choose a cyclic distribution scheme to enhance the load balance. We choose a block-cyclic scheme only when pipelining is used in parallelizing a loop and load balance is an issue.
Finally, we find the mapping of computation and data to the physical processors by composing the affine functions with the virtual processor folding function. Internally, the mappings are represented as a set of linear inequalities, even though we use the HPF notation here for expository purposes. The data decomposition is used by the data transformation phase, and the computation decomposition is used to generate the SPMD code.
The decomposition analysis must be performed across the entire program. This is a difficult problem since the compiler must map the decompositions across the procedure boundaries. The inter-procedural analysis must be able to handle array reshapes and array sections passed as parameters. Previously, our implementation was limited by procedure boundaries. We have now implemented a prototype of an inter-procedural version of this algorithm that was used for the experiments described in Section 6.