ADI integration is a stencil computation used for solving partial differential equations. The computation in ADI has two phases --- the first phase sweeps along the columns of the arrays and the second phase sweeps along the rows. Two representative loops of the code are shown in Figure 9. Figure 10 shows the speedups for each version of ADI Integration on two different data set sizes ( and ).
Figure 9: ADI Integration Code
Figure 10: ADI Integration Speedups
Given that the base compiler analyzes each loop separately, it makes the logical decision to parallelize and distribute first the column sweeps, then the row sweeps across the processors. This base compiler retains the spatial locality in each of the loop nests, and inserts only one barrier at the end of each two-deep loop nest. Unfortunately, this means each processor accesses very different data in different parts of the algorithm. Furthermore, while data accessed by a processor in column sweeps are contiguous, the data accessed in row sweeps are distributed across the address space. As a result of the high miss rates and high memory access costs, the performance of the base version of ADI is rather poor.
By analyzing across the loops in the program, the computation decomposition algorithm finds a static block column-wise distribution. This version of the program exploits doall parallelism in the first phase of ADI, switching to doall/pipeline parallelism the second half of the computation to minimize true-sharing communication[4,18]. Loops enclosed within the doacross loop are tiled to increase the granularity of pipelining, thus reducing synchronization overhead. The optimized version of ADI achieves a speedup of 23 on 32 processors. Since each processor's data are already contiguous, no data transformations are needed for this example.