A number of researchers have addressed problems that are related to the decomposition problem. Sarkar and Gao have developed an algorithm that uses collective loop transformations to perform array contraction. They use loop interchange and reversal transformations to orient the computation. Ju and Dietz use a search-based algorithm to find data layout and loop restructuring combinations that reduce cache coherence overhead on shared memory machines. Hwang and Hu describe a method for finding the computation mapping of two systolic array stages that share a single array. Their algorithm works by first calculating the projection vector, which is similar to what we call the partition, of the computation mapping.
Many projects have examined the problem of finding array alignments (what we call data orientations and displacements) for data parallel programs[8,11,21,30,35]. These approaches focus on element-wise array operations, and try to eliminate the communication between consecutive loops.
Li and Chen prove the problem of finding optimal orientations NP-complete, and have developed a heuristic solution which is used to implement their functional language Crystal on message-passing machines. In contrast to these approaches, our model supports loop nests containing both parallel and sequential loops and general affine array index functions. These approaches all optimize for a fixed degree of parallelism, whereas we make explicit decisions about which loops are run in parallel.
Several researchers have developed data decomposition algorithms based on searching through a fixed set of possible decompositions. Gupta and Banerjee have developed an algorithm for automatically finding a static data decomposition. Their approach is based on an exhaustive search through various possible decompositions using a system of cost estimates. Carle et. al. have developed an interactive tool, as part of the Fortran D project, that finds data decompositions within and across phases of a procedure. Data can be remapped dynamically between phases. Their approach uses a static performance estimator to select the best decompositions among a fixed set of choices. In comparison, our algorithm avoids expensive searches by systematically calculating the decompositions. As a result of our mathematical model, we are able to derive decompositions that take into account pipeline communication within loop nests and data reorganization communication across loop nests.