In this section we discuss the problem of finding data and computation decompositions that have pipelined communication, but no data reorganization communication.
The previous section only considered the parallelism available in forall loops. However, it may be the case that it is not possible to legally transform the iteration space so that there are outermost forall loops. For example, consider the four point difference operation:
Here the parallelism is only available along a wavefront, or diagonal, of the original loop nest. Tiling[37,39] (also known as blocking, unroll-and-jam and stripmine-and-interchange) is a well-known transformation that allows both parallelism and locality to be exploited within a loop nest.
The original iteration space is shown in Figure 3(a) and Figure 3(b) demonstrates how this loop can be executed in parallel using doacross parallelism. The iterations in each shaded block are assigned to different processors. The computation proceeds along the wavefront dynamically, by using explicit synchronization to enforce the dependences between the blocks.
Figure 3: (a) Original iteration space. (b)--(d) Iteration spaces showing the parallel execution of tiled loops. The arrows represent data dependences.
When all dimensions of the iteration space are blocked, there will be idle processors as only blocks along the diagonal can execute in parallel. We can gain the advantages of tiling without idle processors by assigning entire rows (Figure 3(c)) or columns (Figure 3(d)) to different processors. In these two cases, each processor is assigned a strip of the iteration space, and all processors can start executing in parallel. For example, the tiled code that corresponds to Figure 3(d) is as follows:
When this loop nest is tiled, the original loop is split into two dimensions: the outer loop and the inner loop. Allocating each shaded strip from Figure 3(d) to a different processor spreads iterations of the loop across processors, while iterations of the loop reside on the same processor.
Tiling can also be used to reduce communication across loop nests, even when forall parallelism is available in both nests. Consider the following example of an ADI (Alternating Direction Implicit) integration:
In the first loop nest, the sequential loop accesses columns of X. In the second loop nest, the sequential loop accesses rows of X. For there to be no communication, then the data partition for X must be , and the computation partition for both loops must be . This partition specifies that the entire array X is allocated on the same processor, and that both loops run sequentially. Only considering the parallelism available in the forall loops provides only two options: either run the loops sequentially, or incur data reorganization communication between the two loops.
However, if the compiler tiles both loops to extract wavefront parallelism, then the reorganization communication is reduced to inexpensive pipelined communication. A tiled version of the ADI code is shown below.
In both loop nests, the outer is distributed across processors, and the inner and loops are executed on the same processor. Therefore, each processor is assigned a block of columns of the array. In the first loop nest, there are dependences across the blocks and there is pipelined communication within the loop nest. In the second loop nest, the data dependences are within the block so no communication is necessary.