The scope of this compiler algorithm was limited to affine array accesses within scientific applications. By prefetching only the affine accesses, we covered over 90%of the cache misses for roughly half of the benchmarks, while the coverage was closer to 50%of the misses for the remaining benchmarks. In order to increase our coverage, we need to handle more complex access patterns. A good starting point would be sparse matrices, which would account for many of the remaining misses in CG.
Another problem we observed was the large number of conflicts that can occur in direct-mapped caches. These conflicts can have a severe impact on memory performance, regardless of whether prefetching is used. In the case of our algorithm, random conflicts make it even more difficult to predict the contents of the cache. For our experiments, we alleviated this problem by manually changing the alignment of some matrices in four of the applications. However, the burden of fixing such problems should not rest entirely on the programmer.
The use of profiling feedback might also improve the accuracy of our algorithm. As we mentioned earlier in Section 4.2, control-flow profiling information could be used to resolve unknown iteration counts. Also, techniques for profiling memory performance may give the compiler more insight into the caching behavior of the program. At one extreme, the exact miss rates of each reference could be measured-however, this would require simulating the application, which may be prohibitively expense. On the other hand, less expensive profiling techniques may provide the compiler with enough useful information, such as which loops suffer the greatest memory penalty. We are currently evaluating this tradeoff between the cost of collecting profiling information and benefit offered in terms of performance.