Since prefetching hides rather than reduces latency, it can only improve performance if additional memory bandwidth is available. This is because prefetching does not decrease the number of memory accesses-it simply tries to perform them over a shorter period of time. Therefore, if a program is already memory-bandwidth limited, it is impossible for prefetching to increase performance. Locality optimizations such as cache blocking, however, actually decrease the number of accesses to memory, thereby reducing both latency and required bandwidth. Therefore, the best approach for coping with memory latency is to first reduce it as much as possible, and then hide whatever latency remains. Our compiler can do both things automatically by first applying locality optimizations and then inserting prefetches.
We compiled each of the benchmarks with the locality optimizer enabled . In two of the cases (GMTRY and VPENTA) there was a significant improvement in locality. Both of those cases are presented in Figure 8. In the figure, we show the three original performance bars (seen previously in Figure 4) as well as three new cases which include locality optimization by itself and in combination with the two prefetching schemes.
In the case of GMTRY, the locality optimizer is able to block the critical loop nest. With this locality optimization alone, 90%of the original memory stall time is eliminated. Comparing blocking with prefetching, we see that blocking had better overall performance than prefetching in this case. Although prefetching reduces more of the memory stall cycles, blocking has the advantage of not suffering any of the instruction or memory overhead of prefetching. Comparing the prefetching schemes before and after blocking, we see that blocking has improved the performance of both schemes. One reason is that memory overheads associated with prefetching have been eliminated with blocking since less memory bandwidth is consumed. Also, the selective prefetching scheme reduces its instruction overhead by recognizing that blocking has occurred and thereby issuing fewer prefetches. The best performance overall occurs with blocking, both alone and in combination with selective prefetching.
For VPENTA, the locality optimizer introduces spatial locality for every reference in the inner loop by interchanging two of the surrounding loops. So rather than missing on every iteration, the references only miss when they cross cache line boundaries (every fourth iteration). With this locality optimization alone, the performance improves significantly. However, the selective prefetching scheme without this optimization still performs better, since it manages to eliminate almost all memory stall cycles. Comparing the prefetching schemes before and after the loop interchange, we see that the indiscriminate prefetching scheme changes very little while the selective prefetching scheme improves considerably. The selective scheme improves because it recognizes that after loop interchange it only has to issue one fourth as many prefetches. Consequently it is able to reduce its instruction overhead accordingly. The best overall performance, by a substantial margin, comes only through the combination of both locality optimization and prefetching.
Finally, we would like to note that this is the first time that experimental results have been presented where both locality optimization and prefetch insertion have been handled fully automatically by the compiler. The results have demonstrated the complementary interactions that can occur between locality optimizations and prefetching. Locality optimizations help prefetching by reducing the amount of data that needs to be prefetched, and prefetching helps locality optimizations by hiding any latency that could not be eliminated.