We have implemented the algorithms described in this paper in the SUIF compiler at Stanford. The experiments described in this section were performed on the Stanford DASH shared-memory multiprocessor. Since we do not have a code generator for DASH at this point, we implemented by hand parallel SPMD programs with the decompositions generated by our compiler. All programs were compiled with the SGI f77 compiler at the -O2 optimization level.
The DASH multiprocessor is made up of a number of physically distributed clusters. Each cluster is based on Silicon Graphics POWER Station 4D/340, consisting of 4 MIPS R3000/R3010 processors.
A directory-based protocol is used to maintain cache coherence across clusters. It takes a processor 1 cycle to retrieve data from its cache, 29 cycles from its local memory and 100-130 cycles from a remote memory. The DASH operating system allocates memory to clusters at the page level; if a page is not assigned to a specific cluster then it is allocated to the first cluster that touches the page.
We compare the decomposition our algorithm finds with the decomposition the SGI Power Fortran Accelerator (version 4.0.5) parallelizing compiler used. We also compare our results with other possible decompositions. We ran our programs on an 8-cluster DASH multiprocessor, with 28MB of main memory per cluster.
We looked at the heat conduction phase of the application SIMPLE, a two-dimensional Lagrangian hydrodynamics code from Lawrence Livermore National Lab. The heat conduction routine conduct is 165 lines long and has about 20 loop nests. Within this routine is a set of loops that performs an ADI integration where the parallelism is first across the rows of the arrays and then across the columns of the arrays. In all cases, we used a blocked distribution scheme.
Figure 7 shows the speedups (over the best sequential version) of four different decompositions of this routine, for a problem size of 1K 1K using double precision.
Figure 7: Speedup over sequential execution time for conduct. The problem size is 1K 1K, double precision.
The total amount of data used by this routine is on the order of 128MB. When the amount of memory needed by a cluster exceeds the memory available on that cluster, the DASH operating system allocates the memory on the next available cluster. Thus when executing on four or fewer clusters, the data used in the application may actually be allocated to another cluster.
The first curve labeled no optimization shows the results of the SGI Power Fortran compiler. We allowed the DASH operating system to allocate the pages to the first cluster that accessed the data. Since Fortran arrays are allocated column major, this resulted in blocks of columns being allocated to the clusters. When the only available parallelism is by row, the processors perform remote reads and writes. The second curve labeled static shows the performance if a single data decomposition is used for each array. In this case blocks of rows were made contiguous in the shared address space. This represents the best possible static decomposition if only forall parallelism is exploited. The third curve labeled dynamic, no pipelining reallocates the data when the dimension of the parallelism changes. However, in this case the program incurs the cost of reorganization communication when the data is reallocated. This curve represents the best possible overall decomposition with only forall parallelism. The fourth curve labeled dynamic and pipelining shows the results of allocating blocks of rows contiguously and using explicit synchronization between processors when the parallelism is by column. In this version the processors only synchronize between blocks of columns (we used a block size of 4). This is the decomposition our compiler finds when considering both pipeline and reorganization communication.