next up previous
Next: Evaluation Up: Experimental Results Previous: Experimental Results

Experimental Setup

The inputs to the SUIF compiler are sequential FORTRAN or C programs. The output is a parallelized C program that contains calls to a portable run-time library. The C code is then compiled on the parallel machine using the native C compiler.

Our target machine is the DASH multiprocessor. DASH has a cache-coherent NUMA architecture. The machine we used for our experiments consists of 32 processors, organized into 8 clusters of 4 processors each. Each processor is a 33MHz MIPS R3000, and has a 64KB first-level cache and a 256KB second-level cache. Both the first- and second-level caches are direct-mapped and have 16B lines. Each cluster has 28MB of main memory. A directory-based protocol is used to maintain cache coherence across clusters. It takes a processor 1 cycle to retrieve data from its first-level cache, about 10 cycles from its second-level cache, 30 cycles from its local memory and 100-130 cycles from a remote memory. The DASH operating system allocates memory to clusters at the page level. The page size is 4KB and pages are allocated to the first cluster that touches the page. We compiled the C programs produced by SUIF using gcc version 2.5.8 at optimization level -O3.

To focus on the memory hierarchy issues, our benchmark suite includes only those programs that have a significant amount of parallelism. Several of these programs were identified as having memory performance problems in a simulation study[31]. We compiled each program under each of the methods described below, and plot the speed up of the parallelized code on DASH. All speedups are calculated over the best sequential version of each program.

We compiled the programs with the basic parallelizer in the original SUIF system. This parallelizer has capabilities similar to traditional shared-memory compilers such as KAP[22]. It has a loop optimizer that applies unimodular transformations to one loop at a time to expose outermost loop parallelism and to improve data locality among the accesses within the loop[34,35].

We first applied the basic parallelizer to analyze the individual loops, then applied the algorithm in Section 3 to find computation decompositions (and the corresponding data decompositions) that minimize communication across processors. These computation decompositions are passed to a code generator which schedules the parallel loops and inserts calls to the run-time library. The code generator also takes advantage of the information to optimize the synchronization in the program[32]. The data layouts are left unchanged and are stored according to the FORTRAN convention.

Here, we used the optimizations in the base compiler as well as all the techniques described in this paper. Given the data decompositions calculated during the computation decomposition phase, the compiler reorganizes the arrays in the parallelized code to improve spatial locality, as described in section 4.

next up previous
Next: Evaluation Up: Experimental Results Previous: Experimental Results

Saman Amarasinghe
Fri Apr 7 11:22:17 PDT 1995