We now present an algorithm to find data and
computation partitions that may have pipelined communication.
Our algorithm first tries to apply the partition algorithm
as specified in section 4.3, considering only the
parallelism available in the outermost ** forall** loops.
This will try to find a solution such that every parallelizable
loop has parallelism, and there is
neither reorganization nor pipelined communication.
If such a solution cannot be found,
the compiler then tries to exploit ** doacross** parallelism.

Recall that the local phase of our compiler transforms each loop nest
such that the largest possible fully permutable loop nests are outermost.
Also within each fully permutable nest, any ** forall**
loops are positioned outermost.
A loop nest that is fully permutable can also be fully
tiled[18,38].
If the dependence vectors in the fully permutable loop nest
are all distance vectors, then the pipelined communication is
inexpensive because only the data elements
at the block boundaries need to move.
Otherwise,
the cost of the communication within the
loop must be weighed against the cost of reorganization communication
between the loops.

**Figure 4:** Algorithm for calculating partitions with blocks.

An overview of the algorithm is shown in Figure 4.
Once the algorithm determines that a solution with neither
reorganization nor pipelined communication (and
with at least one degree of parallelism) cannot be found,
it recalculates and .
The partition algorithm in Figure 2 is reapplied --
the only change is in the single loop constraint used to initialize the
computation partitions.
Any dimensions that can be tiled are not considered in the
initial computation partitions.
Thus the initial computation partition for a loop nest of depth **l**
is again the span of a set of **l**-dimensional elementary basis vectors.
If a loop at nesting level **k** is sequential * and*
cannot be tiled, then is included in the initial computation partition.
The multiple array constraints are used as before to initialize
the data partition .
The iterative partition algorithm is then run to find the
data and computation partitions.
The final and represent the computation and data
that must be allocated to the same processor.

We now need to find and to determine the iterations and array elements that are either completely local to a processor or blocked. Note that these vector spaces have already been calculated. When the partition algorithm was called to find a solution with no pipelined communication, the resulting is exactly the vector space for each loop nest. Similarly, the resulting is exactly the vector space for each array. Once the partitions have been found, then an orientation and displacement are calculated as discussed in sections 4.4 and 4.5. When the iteration and data spaces are the blocked, the orientations and displacements are found for entire blocks.

In the ADI example above, both loop nests are fully permutable and can
be completely tiled. When the compiler discovers that the ** forall**
parallelism cannot be exploited without communication, it tries to
exploit the ** doacross** parallelism in these loops.
The initial computation partitions are ,
and the initial data partition is .
Running the iterative partition algorithm does not change the partitions.
Since and
, the spaces are completely tiled.

Note that the algorithm yields a solution that allows the entire iteration and data space to be tiled, and does not over-constrain the partitions unnecessarily. This solution can have idle processors, as was shown in Figure 3(b) above. The optimizations our compiler uses to reduce idle processors are described in Section 7.1.

When we exploit ** doacross** parallelism,
it is possible that
different processors will write to the same array location within a loop nest.
Thus,
there is no single static data decomposition for the array at
that loop nest.
In these cases, however, the amount of data that must be communicated
is small with respect to the amount of computation.
A code generator for a shared address space machine does not need
to know exactly which processor has the current value of the data.
Our code generator for distributed address space machines uses
data-flow analysis on individual array accesses
to find efficient communication when the data moves
within a loop nest[2].

Fri Apr 7 14:39:58 PDT 1995