We first illustrate the issues involved in optimizing memory system performance on multiprocessors, and define the terms that are used in this paper. True sharing cache misses occur whenever two processors access the same data word. True sharing requires the processors involved to explicitly synchronize with each other to ensure program correctness. A computation is said to have temporal locality if it re-uses much of the data it has been accessing; programs with high temporal locality tend to have less true sharing. The amount of true sharing in the program is a critical factor for performance on multiprocessors; high levels of true sharing and synchronization can easily overwhelm the advantage of parallelism.
It is important to take synchronization and sharing into consideration when deciding on how to parallelize a loop nest and how to assign the iterations to processors. Consider the code shown in Figure 1(a). While all the iterations in the first two-deep loop nest can run in parallel, only the inner loop of the second loop nest is parallelizable. To minimize synchronization and sharing, we should also parallelize only the inner loop in the first loop nest. By assigning the ith iteration in each of the inner loops to the same processor, each processor always accesses the same rows of the arrays throughout the entire computation. Figure 1(b) shows the data accessed by each processor in the case where each processor is assigned a block of rows. In this way, no interprocessor communication or synchronization is necessary.
Figure 1: A simple example: (a) sample code, (b) original data mapping and (c) optimized data mapping. The light grey arrows show the memory layout order.
Due to characteristics found in typical data caches, it is not sufficient to just minimize sharing between processors. First, data are transferred in fixed-size units known as cache lines, which are typically 4 to 128 bytes long. A computation is said to have spatial locality if it uses multiple words in a cache line before the line is displaced from the cache. While spatial locality is a consideration for both uni- and multiprocessors, false sharing is unique to multiprocessors. False sharing results when different processors use different data that happen to be co-located on the same cache line. Even if a processor re-uses a data item, the item may no longer be in the cache due to an intervening access by another processor to another word in the same cache line.
Assuming the FORTRAN convention that arrays are allocated in column-major order, there is a significant amount of false sharing in our example, as shown in Figure 1(b). If the number of rows accessed by each processor is smaller than the number of words in a cache line, every cache line is shared by at least two processors. Each time one of these lines is accessed, unwanted data are brought into the cache. Also, when one processor writes part of the cache line, that line is invalidated in the other processor's cache. This particular combination of computation mapping and data layout will result in poor cache performance.
Another problematic characteristic of data caches is that they typically have a small set-associativity; that is, each memory location can only be cached in a small number of cache locations. Conflict misses occur whenever different memory locations contend for the same cache location. Since each processor only operates on a subset of the data, the addresses accessed by each processor may be distributed throughout the shared address space.
Consider what happens to the example in Figure 1(b) if the arrays are of size and the target machine has a direct-mapped cache of size 64KB. Assuming that REALs are 4B long, the elements in every 16th column will map to the same cache location and cause conflict misses. This problem exists even if the caches are set-associative, given that existing caches usually only have a small degree of associativity.
As shown above, the cache performance of multiprocessor code depends on how the computation is distributed as well as how the data are laid out. Instead of simply obeying the data layout convention used by the input language (e.g. column-major in FORTRAN and row-major in C), we can improve the cache performance by customizing the data layout for the specific program. We observe that multiprocessor cache performance problems can be minimized by making the data accessed by each processor contiguous in the shared address space, an example of which is shown in Figure 1(c). Such a layout enhances spatial locality, minimizes false sharing and also minimizes conflict misses.
The importance of optimizing memory subsystem performance for multiprocessors has also been confirmed by several studies of hand optimizations on real applications. Singh et al. explored performance issues on scalable shared address space architectures; they improved cache behavior by transforming two-dimensional arrays into four-dimensional arrays so that each processor's local data are contiguous in memory. Torrellas et al. and Eggers et al.[11,12] also showed that improving spatial locality and reducing false sharing resulted in significant speedups for a set of programs on shared-memory machines. In summary, not only must we minimize sharing to achieve efficient parallelization, it is also important to optimize for the multi-word cache line and the small set associativity. The cache behavior depends on both the computation mapping and the data layout. Thus, besides choosing a good parallelization scheme and a good computation mapping, we may also wish to change the data structures in the program.