next up previous
Next: Parallelization Analysis Techniques Up: Detecting Coarse-Grain Parallelism Using Previous: Detecting Coarse-Grain Parallelism Using

Introduction

Symmetric shared-memory multiprocessors, built out of the latest microprocessors, are now a widely available class of computationally powerful machines. As hardware technology advances make pervasive parallel computing a possibility, it is ever more important that tools be developed to simplify parallel programming. A parallelizing compiler that automatically locates parallel computations in sequential programs is a particularly attractive programming tool, as it frees programmers from the difficult task of explicitly managing parallelism in their programs.

Unfortunately, today's commercially available parallelizing compilers are not effective at getting good performance on multiprocessors [3,23]. As these parallelizers were developed from vectorizing compiler technology, they tend to be successful in parallelizing only innermost loops. Parallelizing just inner loops is not adequate for multiprocessors for two reasons. First, inner loops may not make up a significant portion of the sequential computation, thus limiting the parallel speedup by limiting the amount of parallelism. Second, synchronizing processors at the end of the inner loops leaves little computation occurring in parallel between synchronization points. The cost of frequent synchronization and load imbalance can potentially overwhelm the benefits of parallelization.

Multiprocessors are more powerful than vector machines in that they can execute different threads of control simultaneously, and can thus exploit a coarser granularity of parallelism. Thus, for a parallelizing compiler to target a multiprocessor effectively, it must identify outer parallelizable loops to extract coarse-grain parallelism. This requires two major improvements over standard parallelization techniques:

Advanced Array Analyses.
A loop is often not parallelizable unless the compiler modifies the data structures it accesses. For example, it is very common for each iteration of a loop to define and use the same variable. The compiler must give each processor a private copy of the variable for the loop to be parallelizable. As another example, a compiler can parallelize a reduction ( e.g., computation of a sum, product, or maximum over data elements) by having each processor compute a partial reduction locally and update the global result at the end. Compilers traditionally only perform privatization or reduction tranformations to scalar variables. To find outer parallel loops, the compiler must be able to perform these transformations on array variables as well [3,23].

Interprocedural Analysis.
If programs are written in a modular style, it is natural that coarse-grain parallel loops will span multiple procedures. For this reason, procedure boundaries must not pose a barrier to analysis [3]. One way to eliminate procedure boundaries is to perform inline substitution---replacing each procedure call by a copy of the called procedure---and perform program analysis in the usual way. This is not a practical solution for large programs, as it is inefficient in both time and space.

Interprocedural analysis, which applies data-flow analysis techniques across procedure boundaries, can be much more efficient as it analyzes only a single copy of each procedure. Although much research has been devoted to interprocedural analysis for parallelization [8,12,13,15,18], it has not been adopted in practice. The primary obstacle to progress in this area has been that effective interprocedural compilers are substantially harder to build than their intraprocedural counterparts. Moreover, there is an inherent tradeoff between performing analysis efficiently and obtaining precise results. To be successful, an interprocedural compiler must tackle the complexity of the compilation process, while maintaining reasonable efficiency without sacrificing too much precision.

We have developed an automatic parallelization system that is fully interprocedural. The system incorporates all the standard analyses included in today's automatic parallelizers, such as data dependence analysis, analyses of scalar variables including constant propagation, value numbering, induction variable recognition and scalar dependence and reduction recognition. In addition, the system employs analyses for array privatization and array reduction recognition. The implementation of these techniques extends previous work to meet the demands of parallelizing real programs. The interprocedural analysis is designed to be practical while providing nearly the same quality of analysis as if the program were fully inlined.

This paper presents a comprehensive evaluation of the effectiveness of this system at locating coarse-grain parallel computations in a large collection of programs from the SPEC92FP, NAS and PERFECT benchmark suites. We demonstrate how the techniques in our system significantly improve the performance of automatically parallelized codes both by increasing the portion of the program that is executed in parallel and by reducing the synchronization overhead as a result of parallelizing outer loops rather than inner ones.

The remainder of the paper is organized into seven sections. In Sections 2, we present the types of advanced analyses required to parallelize full applications, and in Section 3, we overview the requirements for precise and efficient interprocedural analysis. Section 4 overviews the parallelization analysis in our system. Section 5 compares our work with other automatic parallelization systems. Section 6 is devoted to the empirical evaluation of this system, followed by a conclusion.



next up previous
Next: Parallelization Analysis Techniques Up: Detecting Coarse-Grain Parallelism Using Previous: Detecting Coarse-Grain Parallelism Using



Saman Amarasinghe
Fri Sep 15 09:15:06 PDT 1995