In the last decade, microprocessor speeds have been steadily improving at a rate of 50% to 100% every year. Meanwhile, memory access times have been improving at the rate of only 7% per year. A common technique used to bridge this gap between processor and memory speeds is to employ one or more levels of caches. However, it has been notoriously difficult to use caches effectively for numeric applications. In fact, various past machines built for scientific computations such as the Cray C90, Cydrome Cydra-5 and the Multiflow Trace were all built without caches. Given that the processor-memory gap continues to widen, exploiting the memory hierarchy is critical to achieving high performance on modern architectures.
Recent work on code transformations to improve cache performance has been shown to improve uniprocessor system performance significantly[9,34]. Making effective use of the memory hierarchy on multiprocessors is even more important to performance, but also more difficult. This is true for bus-based shared address space machines[11,12], and even more so for scalable shared address space machines, such as the Stanford DASH multiprocessor, MIT ALEWIFE, Kendall Square's KSR-1, and the Convex Exemplar. The memory on remote processors in these architectures constitutes yet another level in the memory hierarchy. The differences in access times between cache, local and remote memory can be very large. For example, on the DASH multiprocessor, the ratio of access times between the first-level cache, second-level cache, local memory, and remote memory is roughly 1:10:30:100. It is thus important to minimize the number of accesses to all the slower levels of the memory hierarchy.