Our next program is LU decomposition without pivoting. The code is shown in Figure 5 and the speedups for each version of LU decomposition are displayed in Figure 6 for two different data set sizes ( and ).
Figure 5: LU Decomposition Code
The base compiler identifies the second loop as the outermost parallelizable loop nest, and distributes its iterations uniformly across processors in a block fashion. As the number of iterations in this parallel loop varies with the index of the outer sequential loop, each processor accesses different data each time through the outer loop. A barrier is placed after the distributed loop and is used to synchronize between iterations of the outer sequential loop.
The computation decomposition algorithm minimizes true-sharing by assigning all operations on the same column of data to the same processor. For load balance, the columns and operations on the columns are distributed across the processor in a cyclic manner. By fixing the assignment of computation to processors, the compiler replaces the barriers that followed each execution of the parallel loop by locks. Even though this version has good load balance, good data re-use and inexpensive synchronization, the local data accessed by each processor are scattered in the shared address space, increasing chances of interference in the cache between columns of the array. The interference is highly sensitive to the array size and the number of processors; the effect of the latter can be seen in Figure 6. This interference effect can be especially pronounced if the array size and the number of processors are both powers of 2. For example, for the matrix, every 8th column maps to the same location in DASH's direct-mapped 64K cache. The speedup for 31 processors is 5 times better than for 32 processors.
The data transformation algorithm restructures the columns of the array so that each processor's cyclic columns are made into a contiguous region. After restructuring, the performance stabilizes and is consistently high. In this case the compiler is able to take advantage of inexpensive synchronization and data re-use without incurring the cost of poor cache behavior. Speedups become super-linear in some cases due to the fact that once the data are partitioned among enough processors, each processor's working set will fit into local memory.
Figure 6: LU Decomposition Speedups