Various hardware and software approaches to improve the memory performance have been proposed recently. A promising technique to mitigate the impact of long cache miss penalties is software-controlled prefetching. Software-controlled prefetching requires support from both hardware and software. The processor must provide a special ``prefetch'' instruction. The software uses this instruction to inform the hardware of its intent to use a particular data item; if the data is not currently in the cache, the data is fetched in from memory. The cache must be lockup-free; that is, the cache must allow multiple outstanding misses. While the memory services the data miss, the program can continue to execute as long as it does not need the requested data. While prefetching does not reduce the latency of the memory access, it hides the memory latency by overlapping the access with computation and other accesses. Prefetches on a scalar machine are analogous to vector memory accesses on a vector machine. In both cases, memory accesses are overlapped with computation and other accesses. Furthermore, similar to vector registers, prefetching allows caches in scalar machines to be managed by software. A major difference is that while vector machines can only operate on vectors in a pipelined manner, scalar machines can execute arbitrary sets of scalar operations well.
Another useful memory hierarchy optimization is to improve data locality by reordering the execution of iterations. One important example of such a transform is blocking. Instead of operating on entire rows or columns of an array, blocked algorithms operate on submatrices or blocks, so that data loaded into the faster levels of the memory hierarchy are reused. Other useful transformations include unimodular loop transforms such as interchange, skewing and reversal. Since these optimizations improve the code's data locality, they not only reduce the effective memory access time but also reduce the memory bandwidth requirement. Memory hierarchy optimizations such as prefetching and blocking are crucial to turn high-performance microprocessors into effective scientific engines.