Given that the dynamic decomposition problem is NP-hard, our compiler algorithm uses heuristics to find dynamic decompositions. Our dynamic algorithm uses a greedy approach that eliminates the largest amounts of communication first. The algorithm joins the loop nodes that have the greatest edge costs into the same component, thus eliminating the possibility of data reorganization between those two loop nodes. We only consider loop nests that have some degree of parallelism when joining components. Purely sequential loops are treated as being in a component by themselves. An overview of the algorithm is shown in Figure 6. The routine Single_Level describes the algorithm for the base case of a single nesting level. The rest of the algorithm deals with the multiple level case and is described in the next section.
Figure 6: Algorithm for calculating dynamic decompositions.
At each nesting level, the algorithm operates on a communication graph. The algorithm initializes the components such that each node in the communication graph is its own component, and then calculates the edge weights. The edge weights are a worst-case approximation of the actual communication cost. The worst case occurs when none of the decompositions match and all the data must be reorganized between each loop nest.
The edges in E are examined in decreasing order of their weights. For each edge , the algorithm tries to join the current component of u and the current component of v into a single component. An interference graph is created from the loop nodes (and arrays referenced in the loops) in the new, joined component. The partition algorithm from section 5 is called with the interference graph to find the new partitions.
In forming the new component, the algorithm eliminates the data reorganization cost of the edge. However, the union operation may cause some (or all) of the loop nodes to execute sequentially, or it may generate pipeline communication within loop nodes (as a result of tiling). The algorithm finds the value of the graph before and after the new partitions have been calculated. If the value of the graph is greater after the join, then the new component is saved. The algorithm then records the new partitions of all loops and arrays within the new component. Otherwise, there is communication along the edge , and the new component is discarded.
Consider the communication graph of Figure 5(a). For this example, we assume that the loop node weights are very large. As none of dependences in the code are distance vectors, we assume that tiling is not practical. The edge between nodes 1 and 4 is examined first. The partition algorithm determines that there is at least one degree of parallelism without data reorganization communication between the two loops, so the nodes are joined. Next the algorithm examines either the edge between nodes 1 and 2 or the edge between 2 and 4. In this case the partition algorithm can still find parallelism among the nodes 1, 2 and 4. Next, the algorithm tries to add node 3 into the component. This time the partition algorithm finds that the only way to eliminate reorganization communication is to run all four loops sequentially and the algorithm decides not to add node 3. Thus, nodes 1, 2 and 4 form one component and node 3 is in another component. Figure 5(b) illustrates the resulting components and Figure 5(c) shows the final decompositions within each component.