Given the data decompositions calculated by the computation decomposition phase, there are many equivalent memory layouts that make each processor's data contiguous in the shared address space. Consider the example in Figure 1. Each processor accesses a two-dimensional block of data. There remains the question of whether the elements within the blocks, and the blocks themselves, should be organized in a column-major or row-major order. Our current implementation simply retains the original data layout as much as possible. That is, all the data accessed by the same processor maintain the original relative ordering. We expect this compilation phase to be followed by another algorithm that analyzes the computation executed by each processor and improves the cache performance by reordering data and operations on each processor.
Since general affine decompositions rarely occur in practice and the corresponding data transformations would result in complex array access functions, our current implementation requires that only a single array dimension can be mapped to one processor dimension.
Our algorithm applies the following procedure to each distributed array dimension. First, we apply strip-mining according to the type of distribution:
Figure 3 shows how our algorithm restructures several two-dimensional arrays according to the specified distribution. Figure 3(a) shows the intermediate array index calculations corresponding to the strip-mining step of the algorithm. Figure 3(b) shows the final array indices obtained by permuting the processor-defining dimension to the rightmost position of the array index function. Figure 3(c) shows an example of the new array layout in memory and Figure 3(d) shows the new dimensions of the restructured array. The figure high-lights the set of data belonging to the same processor; the new array indices and the linear addresses at the upper right corner indicate that they are contiguous in the address space.
: Changing data layouts: (a) strip-mined array indices, (b) final array indices, (c) new indices in restructured array and (d) array bounds.
This technique can be repeated for every distributed dimension of a multi-dimensional array. Each of the distributed dimensions contributes one processor-identifying dimension that is moved to the rightmost position. As discussed before, it does not matter how we order these higher dimensions. By not permuting those dimensions that do not identify processors, we retain the original relative ordering among data accessed by the same processor.
Finally, we make one minor local optimization. If the highest dimension of the array is distributed as BLOCK, no permutation is necessary since the processor dimension is already in the rightmost position; thus no strip-mining is necessary either since, as discussed above, strip-mining on its own does not change the data layout.
Note that HPF statements can also be used as input to the data transformation algorithm. If an array is aligned to a template which is then distributed, we must find the equivalent distribution on the array directly. We use the alignment function to map from the distributed template dimensions back to the corresponding array dimensions. Any offsets in the alignment statement are ignored.