As displayed in Figure 5(B3)-(D3), the advanced array analyses significantly improve the parallelism coverage of bdna and qcd. For bdna, the additional parallel loops provide a reasonable granularity that leads to speedup. Granularity is increased for spec77 and trfd, and speedup is achieved in the case of trfd. Although little parallel speedup is observed on spec77, the improvement over the baseline system confirms the validity of our preference for outer loop parallelism. As a whole, SUIF doubles the number of programs that achieve a speedup from 2 to 4.
The overall parallelization of Perfect was not as successful as for the other two benchmark suites. As Figure 5 indicates, there are two basic problems. Half of the programs have coverage below 80%. Furthermore, the parallelism found is rather fine-grained, with most of the parallelizable loops taking less than 100 s on a uniprocessor. In fact, had the run-time system not suppressed the parallelization of fine-grained loops in PERFECT, the results would have been much worse. Thus, not only is the coverage low, the system can only exploit a fraction of the parallelism extracted.
We now examine the difficulties in parallelizing PERFECT to determine the feasibility of automatic parallelization and to identify possible future research directions. We found that some of these programs are simply not parallelizable as implemented. Some of these programs contain a lot of input and output (e.g. mg3d and spec77); their speedup depends on the success of parallelizing I/O. Further, ``dusty deck'' features of these programs, such as the use of equivalence constructs in ocean, obscure information from analysis. In contrast, most of the SPEC92FP and NAS programs are cleanly implemented, and are thus more amenable to automatic parallelization.
For many of these programs, particularly ocean, adm, and mdg, there are key computational loops that are safe to parallelize, but they are beyond the scope of the techniques implemented in SUIF. Ocean and adm contain non-linear array subscripts involving multiplicative induction variables that are beyond the scope of the higher-order induction variable recognition. There will always be extensions to an automatic parallelization system that can improve its effectiveness for some programs; nonetheless, there is a fundamental limitation to static parallelization. Some programs cannot be parallelized with only compile-time information. For example, the main loop in adm is parallelizable only if the problem size, which is unknown at compile time, is even. A promising solution is to have the program check if the loop is parallelizable at run time, using dynamic information. Interprocedural analysis and optimization can play an important part in such an approach by improving the efficiency of the run-time tests. It can derive highly optimized run-time tests and hoist them to less frequently executed portions of the program, possibly even across procedure boundaries. The interprocedural analysis in our system provides an excellent starting point for work in this area.
The advanced analysis can also form the basis for a useful interactive parallelization system. Even when the analyses are not strong enough to determine that a loop is parallelizable, the results can be used to isolate the problematic areas and focus the users' attention on them. For example, our compiler finds in the program qcd a 617-line interprocedural loop that would be parallelizable if not for a small procedure. Examination of that procedure reveals that it is a random number generator, which a user can potentially modify to run in parallel. By requesting very little help from the user, the compiler can parallelize the loop and perform all the tedious privatization and reduction transformations automatically.