The inputs to the SUIF compiler are sequential FORTRAN or C programs. The output is a parallelized C program that contains calls to a portable run-time library. The C code is then compiled on the parallel machine using the native C compiler.
Our target machine is the DASH multiprocessor. DASH has a cache-coherent NUMA architecture. The machine we used for our experiments consists of 32 processors, organized into 8 clusters of 4 processors each. Each processor is a 33MHz MIPS R3000, and has a 64KB first-level cache and a 256KB second-level cache. Both the first- and second-level caches are direct-mapped and have 16B lines. Each cluster has 28MB of main memory. A directory-based protocol is used to maintain cache coherence across clusters. It takes a processor 1 cycle to retrieve data from its first-level cache, about 10 cycles from its second-level cache, 30 cycles from its local memory and 100-130 cycles from a remote memory. The DASH operating system allocates memory to clusters at the page level. The page size is 4KB and pages are allocated to the first cluster that touches the page. We compiled the C programs produced by SUIF using gcc version 2.5.8 at optimization level -O3.
To focus on the memory hierarchy issues, our benchmark suite includes only those programs that have a significant amount of parallelism. Several of these programs were identified as having memory performance problems in a simulation study. We compiled each program under each of the methods described below, and plot the speed up of the parallelized code on DASH. All speedups are calculated over the best sequential version of each program.