Next: Dynamic Decompositions Up: Blocked Decompositions Previous: Blocked Decomposition Model

## Calculating Blocked Decompositions

We now present an algorithm to find data and computation partitions that may have pipelined communication. Our algorithm first tries to apply the partition algorithm as specified in section 4.3, considering only the parallelism available in the outermost forall loops. This will try to find a solution such that every parallelizable loop has parallelism, and there is neither reorganization nor pipelined communication. If such a solution cannot be found, the compiler then tries to exploit doacross parallelism.

Recall that the local phase of our compiler transforms each loop nest such that the largest possible fully permutable loop nests are outermost. Also within each fully permutable nest, any forall loops are positioned outermost. A loop nest that is fully permutable can also be fully tiled[18,38]. If the dependence vectors in the fully permutable loop nest are all distance vectors, then the pipelined communication is inexpensive because only the data elements at the block boundaries need to move. Otherwise, the cost of the communication within the loop must be weighed against the cost of reorganization communication between the loops.

Figure 4: Algorithm for calculating partitions with blocks.

An overview of the algorithm is shown in Figure 4. Once the algorithm determines that a solution with neither reorganization nor pipelined communication (and with at least one degree of parallelism) cannot be found, it recalculates and . The partition algorithm in Figure 2 is reapplied -- the only change is in the single loop constraint used to initialize the computation partitions. Any dimensions that can be tiled are not considered in the initial computation partitions. Thus the initial computation partition for a loop nest of depth l is again the span of a set of l-dimensional elementary basis vectors. If a loop at nesting level k is sequential and cannot be tiled, then is included in the initial computation partition. The multiple array constraints are used as before to initialize the data partition . The iterative partition algorithm is then run to find the data and computation partitions. The final and represent the computation and data that must be allocated to the same processor.

We now need to find and to determine the iterations and array elements that are either completely local to a processor or blocked. Note that these vector spaces have already been calculated. When the partition algorithm was called to find a solution with no pipelined communication, the resulting is exactly the vector space for each loop nest. Similarly, the resulting is exactly the vector space for each array. Once the partitions have been found, then an orientation and displacement are calculated as discussed in sections 4.4 and 4.5. When the iteration and data spaces are the blocked, the orientations and displacements are found for entire blocks.

In the ADI example above, both loop nests are fully permutable and can be completely tiled. When the compiler discovers that the forall parallelism cannot be exploited without communication, it tries to exploit the doacross parallelism in these loops. The initial computation partitions are , and the initial data partition is . Running the iterative partition algorithm does not change the partitions. Since and , the spaces are completely tiled.

Note that the algorithm yields a solution that allows the entire iteration and data space to be tiled, and does not over-constrain the partitions unnecessarily. This solution can have idle processors, as was shown in Figure 3(b) above. The optimizations our compiler uses to reduce idle processors are described in Section 7.1.

When we exploit doacross parallelism, it is possible that different processors will write to the same array location within a loop nest. Thus, there is no single static data decomposition for the array at that loop nest. In these cases, however, the amount of data that must be communicated is small with respect to the amount of computation. A code generator for a shared address space machine does not need to know exactly which processor has the current value of the data. Our code generator for distributed address space machines uses data-flow analysis on individual array accesses to find efficient communication when the data moves within a loop nest[2].

Next: Dynamic Decompositions Up: Blocked Decompositions Previous: Blocked Decomposition Model

Jennifer-Ann M. Anderson
Fri Apr 7 14:39:58 PDT 1995