Previous studies on limits of parallelism did not remove all of the dependences on induction variables . A question often raised is whether the induction variable dependences significantly affect the results of these studies. We performed two experiments, one with and one without removing the induction variable dependences. Table 4 shows the percent change in parallelism compared to the case when perfect loop unrolling is not performed. That is, a positive percent change means that removing the induction variable dependences improves parallelism. We discovered that removing these data dependences and the associated control dependences has mixed effects.
Although our simulation of perfect loop unrolling always decreases the program execution times, this does not necessarily imply that the parallelism increases. In fact, perfect unrolling has two competing effects on parallelism. By removing index variable dependences and loop branches, more parallelism is exposed. However, at the same time, removing the loop overhead instructions decreases the opportunities for overlapping those instructions with the rest of the computation in the loop, thereby decreasing the parallelism. Either one of these effects may dominate depending on the benchmark and the machine model.
For most of the non-numeric programs, unrolling has little effect. For example, ccom and latex have almost no change at all. These programs primarily contain loops with a lot of control and data dependences, so the dependences removed by unrolling are not very significant. The loops in these programs also tend to iterate a small number of times.
The CD-MF machine is most sensitive to perfect unrolling. Removing induction variable dependences allows multiple iterations with arbitrary control flow to execute in parallel. This can improve parallelism, as in the case of espresso. However, the loop overhead constitutes much of the parallelism found in some loops. Thus, parallelism decreases when we remove such instructions, as in the case of eqntott.
Perfect unrolling has the biggest impact on matrix300 and, to a lesser extent, tomcatv. These programs primarily execute simple loops where index variable dependences limit the parallelism. For these programs, the SP machine benefits the most from perfect unrolling. For nested loops, each iteration of an outer loop is separated by a mispredicted branch from the end of the inner loop. This prevents the outer loop iterations from executing in parallel. Perfect unrolling removes the loop branches, essentially coalescing the loops, so that these serializing mispredictions do not occur.
In general, the effects of loop index and induction variable dependences on parallelism vary depending on the application program and the machine model. As expected, the matrix-oriented numeric programs benefit significantly from perfect loop unrolling. For programs with complex control flow, unrolling often makes no significant difference.