We now present an algorithm to find data and computation partitions that may have pipelined communication. Our algorithm first tries to apply the partition algorithm as specified in section 4.3, considering only the parallelism available in the outermost forall loops. This will try to find a solution such that every parallelizable loop has parallelism, and there is neither reorganization nor pipelined communication. If such a solution cannot be found, the compiler then tries to exploit doacross parallelism.
Recall that the local phase of our compiler transforms each loop nest such that the largest possible fully permutable loop nests are outermost. Also within each fully permutable nest, any forall loops are positioned outermost. A loop nest that is fully permutable can also be fully tiled[18,38]. If the dependence vectors in the fully permutable loop nest are all distance vectors, then the pipelined communication is inexpensive because only the data elements at the block boundaries need to move. Otherwise, the cost of the communication within the loop must be weighed against the cost of reorganization communication between the loops.
Figure 4: Algorithm for calculating partitions with blocks.
An overview of the algorithm is shown in Figure 4.
Once the algorithm determines that a solution with neither
reorganization nor pipelined communication (and
with at least one degree of parallelism) cannot be found,
it recalculates
and
.
The partition algorithm in Figure 2 is reapplied --
the only change is in the single loop constraint used to initialize the
computation partitions.
Any dimensions that can be tiled are not considered in the
initial computation partitions.
Thus the initial computation partition for a loop nest of depth l
is again the span of a set of l-dimensional elementary basis vectors.
If a loop at nesting level k is sequential and
cannot be tiled, then
is included in the initial computation partition.
The multiple array constraints are used as before to initialize
the data partition
.
The iterative partition algorithm is then run to find the
data and computation partitions.
The final
and
represent the computation and data
that must be allocated to the same processor.
We now need to find
and
to determine the
iterations and array elements that are either
completely local to a processor or blocked.
Note that these vector spaces have already been calculated.
When the partition algorithm was called to find a solution with no
pipelined communication, the resulting
is
exactly the vector space
for each loop nest.
Similarly, the resulting
is exactly the vector space
for each array.
Once the partitions have been found, then an orientation and displacement
are calculated as discussed in
sections 4.4 and 4.5.
When the iteration and data spaces are the blocked, the orientations and
displacements are found for entire blocks.
In the ADI example above, both loop nests are fully permutable and can
be completely tiled. When the compiler discovers that the forall
parallelism cannot be exploited without communication, it tries to
exploit the doacross parallelism in these loops.
The initial computation partitions are
,
and the initial data partition is
.
Running the iterative partition algorithm does not change the partitions.
Since
and
, the spaces are completely tiled.
Note that the algorithm yields a solution that allows the entire iteration and data space to be tiled, and does not over-constrain the partitions unnecessarily. This solution can have idle processors, as was shown in Figure 3(b) above. The optimizations our compiler uses to reduce idle processors are described in Section 7.1.
When we exploit doacross parallelism, it is possible that different processors will write to the same array location within a loop nest. Thus, there is no single static data decomposition for the array at that loop nest. In these cases, however, the amount of data that must be communicated is small with respect to the amount of computation. A code generator for a shared address space machine does not need to know exactly which processor has the current value of the data. Our code generator for distributed address space machines uses data-flow analysis on individual array accesses to find efficient communication when the data moves within a loop nest[2].