The code for our next example, a five-point stencil, is shown in Figure 7. Figure 8 shows the resulting speedups for each version of the code. The base compiler simply distributes the outermost parallel loop across the processors, and each processor updates a block of array columns. The values of the boundary elements are exchanged in each time step.
The computation decomposition algorithm assigns two-dimensional blocks to each processor, since this mapping has a better computation to communication ratio than a one-dimensional mapping. However, without also changing the data layout, the performance is worse than the base version because now each processor's partition is non-contiguous (in Figure 8, the number of processors in each of the two dimensions is also shown under the total number of processors).
After the data transformation is applied, the program has good spatial locality as well as less communication, and thus we achieve a speedup of 29 on 32 processors. Note that the performance is very sensitive to the number of processors. This is due to the fact that each DASH cluster has 4 processors and the amount of communication across clusters differs significantly for different two-dimensional mappings.
Figure 7: Five-Point Stencil Code
Figure 8: Five-Point Stencil Speedups