Given the data decompositions calculated by the computation
decomposition phase, there are many equivalent memory layouts that
make each processor's data contiguous in the shared address space.
Consider the example in Figure 1. Each processor
accesses a two-dimensional block of data. There remains the question
of whether the elements within the blocks, and the blocks themselves,
should be organized in a column-major or row-major order. Our current
implementation simply retains the original data layout as much as
possible. That is, all the data accessed by the same processor
maintain the original relative ordering. We expect this compilation
phase to be followed by another algorithm that analyzes the
computation executed by * each* processor and improves the cache
performance by reordering data and operations on each processor.

Since general affine decompositions rarely occur in practice and the corresponding data transformations would result in complex array access functions, our current implementation requires that only a single array dimension can be mapped to one processor dimension.

Our algorithm applies the following procedure to each distributed array dimension. First, we apply strip-mining according to the type of distribution:

- BLOCK.
- Strip-mine the dimension with strip size
, where
**d**is the size of the dimension, and**P**is the number of processors. The identifier of the processor owning the data is specified by the second of the strip-mined dimensions. - CYCLIC.
- Strip-mine the dimension with strip size
**P**, where**P**is the number of processors. The identifier of the processor owning the data is specified by the first of the strip-mined dimensions. - BLOCK-CYCLIC.
- First strip-mine the dimension with
the given block size
**b**as the strip size; then further strip-mine the second of the strip-mined dimensions with strip size**P**, where**P**is the number of processors. The identifier of the processor owning the data is specified by the middle of the strip-mined dimensions.

Figure 3 shows how our algorithm restructures several two-dimensional arrays according to the specified distribution. Figure 3(a) shows the intermediate array index calculations corresponding to the strip-mining step of the algorithm. Figure 3(b) shows the final array indices obtained by permuting the processor-defining dimension to the rightmost position of the array index function. Figure 3(c) shows an example of the new array layout in memory and Figure 3(d) shows the new dimensions of the restructured array. The figure high-lights the set of data belonging to the same processor; the new array indices and the linear addresses at the upper right corner indicate that they are contiguous in the address space.

**:** Changing data layouts:
(a) strip-mined array indices, (b) final array indices, (c) new
indices in restructured array and (d) array bounds.

This technique can be repeated for every distributed dimension of a multi-dimensional array. Each of the distributed dimensions contributes one processor-identifying dimension that is moved to the rightmost position. As discussed before, it does not matter how we order these higher dimensions. By not permuting those dimensions that do not identify processors, we retain the original relative ordering among data accessed by the same processor.

Finally, we make one minor local optimization. If the highest dimension of the array is distributed as BLOCK, no permutation is necessary since the processor dimension is already in the rightmost position; thus no strip-mining is necessary either since, as discussed above, strip-mining on its own does not change the data layout.

Note that HPF statements can also be used as input to the data transformation algorithm. If an array is aligned to a template which is then distributed, we must find the equivalent distribution on the array directly. We use the alignment function to map from the distributed template dimensions back to the corresponding array dimensions. Any offsets in the alignment statement are ignored.

Fri Apr 7 11:22:17 PDT 1995