Our next program is LU decomposition without pivoting.
The code is shown in Figure 5 and the speedups for
each version of LU decomposition are displayed in
Figure 6 for two different
data set sizes (
and
).
Figure 5: LU Decomposition Code
The base compiler identifies the second loop as the outermost parallelizable loop nest, and distributes its iterations uniformly across processors in a block fashion. As the number of iterations in this parallel loop varies with the index of the outer sequential loop, each processor accesses different data each time through the outer loop. A barrier is placed after the distributed loop and is used to synchronize between iterations of the outer sequential loop.
The computation decomposition algorithm minimizes true-sharing by
assigning all operations on the same column of data to the same
processor. For load balance, the columns and operations on the
columns are distributed across the processor in a cyclic manner.
By fixing the assignment of computation to processors, the compiler
replaces the barriers that followed each execution of the parallel
loop by locks. Even though this version has good load balance, good
data re-use and inexpensive synchronization, the local data accessed
by each processor are scattered in the shared address space,
increasing chances of interference in the cache between columns of the
array. The interference is highly sensitive to the array size and the
number of processors; the effect of the latter can be seen in
Figure 6. This interference effect can be
especially pronounced if the array size and the number of processors
are both powers of 2. For example, for the
matrix,
every 8th column maps to the same location in DASH's direct-mapped 64K
cache. The speedup for 31 processors is 5 times better than for 32
processors.
The data transformation algorithm restructures the columns of the array so that each processor's cyclic columns are made into a contiguous region. After restructuring, the performance stabilizes and is consistently high. In this case the compiler is able to take advantage of inexpensive synchronization and data re-use without incurring the cost of poor cache behavior. Speedups become super-linear in some cases due to the fact that once the data are partitioned among enough processors, each processor's working set will fit into local memory.
Figure 6: LU Decomposition Speedups