This paper focuses on translating the concept of prefetching into real performance. Software-controlled prefetching not only incurs an instruction overhead, but can also increase the load on the memory subsystem. It is important to reduce the prefetch overhead by eliminating prefetches for data already in the cache. We have developed an algorithm that identifies those references that are likely to be cache misses, and only issues prefetches for them.
Our experiments show that our algorithm can greatly improve performance-for some programs by as much as a factor of two. We also demonstrate that our algorithm is significantly better than an algorithm that prefetches indiscriminately. Our algorithm reduces the total number of prefetches without decreasing much of the coverage of the cache misses. Finally, our experiments show that software prefetching can complement blocking in achieving better overall performance.
Future microprocessors, with their even faster computation rates, must provide support for memory hierarchy optimizations. We advocate that the architecture provide lockup-free caches and prefetch instructions. More complex hardware prefetching appears unnecessary.