Consider the code shown at the top of Figure 1. In the figure, the array elements at and are shaded light grey and dark grey, respectively, to identify the position of the arrays. The loop iterations are shaded similarly. The figure assumes that arrays are stored in row-major order. A naive approach that considers each loop nest individually would distribute the outermost loop in the first loop nest, to get the coarsest granularity of parallelism. Each processor then accesses rows of arrays X and Y. In the second loop nest, the parallel loop would be distributed, and each processor accesses columns of array Y and rows of array Z. Communication will occur, since a processor accesses a different section of array Y in each of the two loop nests. A solution that has no communication is to only parallelize the loop in the first loop nest. Each processor would then access columns of X, rows of Z, and columns of Y in both loop nests. The loop nests must also be analyzed to determine the relative positions of the arrays so that no communication is necessary. A complete communication-free decomposition is shown in Figure 1(c). The rest of the figure shows the mathematical representation of the decompositions and will be discussed in later sections.
Figure 1: A simple decomposition example. Squares represent array elements and circles represent iterations. Lines connect the array elements and iterations that are allocated to the same processor.