Symmetric shared-memory multiprocessors, built out of the latest microprocessors, are now a widely available class of powerful machines. As hardware technology advances make pervasive parallel computing a possibility, compilers which can extract parallelism from sequential codes become important tools to simplify parallel programming. Unfortunately, today's commercially available parallelizing compilers are not effective at getting good performance on multiprocessors [3,19]. These compilers tend to be successful in parallelizing only innermost loops. Parallelizing just inner loops is not adequate for multiprocessors for two reasons. First, inner loops may not make up a significant portion of the sequential computation, thus limiting the parallel speedup by limiting the amount of parallelism. Second, synchronizing processors at the end of inner loops leaves little computation occurring in parallel between synchronization points. The cost of frequent synchronization and its associated load imbalance can potentially overwhelm the benefits of parallelization.
If compilers are to successfully locate outer, coarse-grain parallel loops, two improvements are needed. First, parallelizing compilers must incorporate advanced array analyses, generalizing techniques currently only applied to scalar variables. For example, the compiler must recognize opportunities for array privatization, whereby storage-related dependences on array variables are eliminated by making a private copy of the array for each processor. As another example, the compiler must recognize opportunities to parallelize array reductions, such as computations of a sum, product, or maximum over array elements.
A second essential requirement for recognizing coarse-grain parallel loops is that procedures must not pose a barrier to analysis. One way to eliminate procedure boundaries is to perform inline substitution---replacing each procedure call by a copy of the called procedure---and perform analysis in the usual way. This is not a practical solution for large programs, as it is inefficient in both time and space. Interprocedural analysis, which applies data-flow analysis techniques across procedure boundaries, can be much more efficient as it analyzes only a single copy of each procedure. However, progress in interprocedural analysis has been inhibited by the complexity of interprocedural systems and the inherent tradeoff between performing analysis efficiently and obtaining precise results.
We have developed an automatic parallelization system that is fully interprocedural, and incorporates all standard analyses included in today's parallelizers, such as data dependence analysis, analyses of scalar values such as induction variable recognition, and scalar dependence and reduction recognition. In addition, the system employs analyses for array privatization and array reduction recognition. This system has allowed extensive empirical evaluation of automatic parallelization of three standard benchmark suites, demonstrating significant improvements over previous interprocedural parallelization systems and the technology available in commercial systems.
This paper describes the components of this system, and the interprocedural analysis framework in which they were developed. The key distinguishing features of this system are as follows. First, the interprocedural analysis is designed to be practical while providing nearly the same quality of analysis as if the program were fully inlined. Second, the array analysis incorporates a mathematical formulation of array reshapes at procedure boundaries, supporting changes in dimension between actual and corresponding formal parameters. Third, the system recognizes interprocedural array reductions. Finally, because the system has been used in an extensive empirical evaluation, the implementations of all the analysis techniques extend previous work to meet the demands of parallelizing real programs.
The remainder of the paper is organized into seven sections. Section 2 compares our work with other automatic parallelization systems. In Section 3, we present the interprocedural analysis framework and algorithm. Sections 4 and 5, describe the analysis of scalar variables and array variables, presented as instantiations of the analysis framework from Section 3. Section 6 describes how the interprocedural array analysis is extended to recognize array reductions. The final two sections discuss experiences with this system and conclude.