Symmetric shared-memory multiprocessors, built out of the latest microprocessors, are now a widely available class of computationally powerful machines. As hardware technology advances make pervasive parallel computing a possibility, it is ever more important that tools be developed to simplify parallel programming. A parallelizing compiler that automatically locates parallel computations in sequential programs is a particularly attractive programming tool, as it frees programmers from the difficult task of explicitly managing parallelism in their programs.
Unfortunately, today's commercially available parallelizing compilers are not effective at getting good performance on multiprocessors [3,23]. As these parallelizers were developed from vectorizing compiler technology, they tend to be successful in parallelizing only innermost loops. Parallelizing just inner loops is not adequate for multiprocessors for two reasons. First, inner loops may not make up a significant portion of the sequential computation, thus limiting the parallel speedup by limiting the amount of parallelism. Second, synchronizing processors at the end of the inner loops leaves little computation occurring in parallel between synchronization points. The cost of frequent synchronization and load imbalance can potentially overwhelm the benefits of parallelization.
Multiprocessors are more powerful than vector machines in that they can execute different threads of control simultaneously, and can thus exploit a coarser granularity of parallelism. Thus, for a parallelizing compiler to target a multiprocessor effectively, it must identify outer parallelizable loops to extract coarse-grain parallelism. This requires two major improvements over standard parallelization techniques:
Interprocedural analysis, which applies data-flow analysis techniques across procedure boundaries, can be much more efficient as it analyzes only a single copy of each procedure. Although much research has been devoted to interprocedural analysis for parallelization [8,12,13,15,18], it has not been adopted in practice. The primary obstacle to progress in this area has been that effective interprocedural compilers are substantially harder to build than their intraprocedural counterparts. Moreover, there is an inherent tradeoff between performing analysis efficiently and obtaining precise results. To be successful, an interprocedural compiler must tackle the complexity of the compilation process, while maintaining reasonable efficiency without sacrificing too much precision.
We have developed an automatic parallelization system that is fully interprocedural. The system incorporates all the standard analyses included in today's automatic parallelizers, such as data dependence analysis, analyses of scalar variables including constant propagation, value numbering, induction variable recognition and scalar dependence and reduction recognition. In addition, the system employs analyses for array privatization and array reduction recognition. The implementation of these techniques extends previous work to meet the demands of parallelizing real programs. The interprocedural analysis is designed to be practical while providing nearly the same quality of analysis as if the program were fully inlined.
This paper presents a comprehensive evaluation of the effectiveness of this system at locating coarse-grain parallel computations in a large collection of programs from the SPEC92FP, NAS and PERFECT benchmark suites. We demonstrate how the techniques in our system significantly improve the performance of automatically parallelized codes both by increasing the portion of the program that is executed in parallel and by reducing the synchronization overhead as a result of parallelizing outer loops rather than inner ones.
The remainder of the paper is organized into seven sections. In Sections 2, we present the types of advanced analyses required to parallelize full applications, and in Section 3, we overview the requirements for precise and efficient interprocedural analysis. Section 4 overviews the parallelization analysis in our system. Section 5 compares our work with other automatic parallelization systems. Section 6 is devoted to the empirical evaluation of this system, followed by a conclusion.