The first step of the algorithm is to analyze each loop nest individually and restructure the loop via unimodular transformations to expose the largest number of outermost parallelizable loops[35]. This is just a preprocessing step, and we do not make any decision on which loops to parallelize yet.
The second step attempts to find affine mappings of data and
computation onto the virtual processor space that maximize parallelism
without requiring any major communication. A necessary and sufficient
condition for a processor assignment to incur no communication cost is
that each processor must access only data that are local to the
processor. Let
be a function mapping iterations in loop nest
j to the processors, and let
be a function mapping elements of
array x to the processors. Let
be a reference to array x
in loop j. No communication is necessary iff
The algorithm tries to find decompositions that satisfy this equation. The objective is to maximize the rank of the linear transformations, as the rank corresponds to the degree of parallelism in the program.
Our algorithm tries to satisfy Equation 1 incrementally, starting with the constraints among the more frequently executed loops and finally to the less frequently executed loops. If it is not possible to find non-trivial decompositions that satisfy the equation, we introduce communication. Read-only and seldom-written data can be replicated; otherwise we generate different data decompositions for different parts of the program. This simple, greedy approach tends to place communication in the least executed sections of the code.
The third step is to map the virtual processor space onto the physical
processor space. At this point, the algorithm has mapped the
computation to an
virtual processor space, where d is the
degree of parallelism in the code, and n is the number of iterations
in the loops (assuming that the number of iterations is the same
across the loops). It is well known that parallelizing as many
dimensions of loops as possible tends to decrease the communication to
computation ratio. Thus, by default, our algorithm partitions the
virtual processor space into d-dimensional blocks and assigns a
block to each physical processor. If we observe that the computation
of an iteration in a parallelized loop either decreases or increases
with the iteration number, we choose a cyclic distribution scheme to enhance
the load balance. We choose a block-cyclic scheme only when pipelining
is used in parallelizing a loop and load balance is an issue.
Finally, we find the mapping of computation and data to the physical processors by composing the affine functions with the virtual processor folding function. Internally, the mappings are represented as a set of linear inequalities, even though we use the HPF notation here for expository purposes. The data decomposition is used by the data transformation phase, and the computation decomposition is used to generate the SPMD code.
The decomposition analysis must be performed across the entire program. This is a difficult problem since the compiler must map the decompositions across the procedure boundaries. The inter-procedural analysis must be able to handle array reshapes and array sections passed as parameters. Previously, our implementation was limited by procedure boundaries[4]. We have now implemented a prototype of an inter-procedural version of this algorithm that was used for the experiments described in Section 6.