The selective prefetching algorithm uses several compile-time parameters to model the behavior of the memory subsystem. Specifically these parameters include the following: (1) cache line size, (2) whether unknown loop bounds are assumed to be large or small, (3) effective cache size, and (4) prefetch latency. The most concrete of these parameters is the cache line size, which can be set precisely. The other parameters, however, are more heuristic in nature. To evaluate the robustness of our algorithm, we measured the effects of varying these less obvious parameters.
The SUIF compiler performs aggressive interprocedural constant propagation to statically determine as many loop bounds as possible. When this fails, our current algorithm assumes unknown loop counts to be small, which tends to overestimate what remains in the cache. When the compiler assumes unknown iteration counts to be large, it produces identical code for 11 of the 13 benchmarks-the two benchmarks that change are OCEAN and MG. For OCEAN, the difference in performance is negligible. However, MG performs 4%worse with the large-loop policy, as shown in Figure 6(a). In this case, the benefit of the extra prefetches is more than offset by increased instruction overhead. Although assuming small loop counts happens to be better in this case, the opposite could easily be true in programs where unknown iteration counts are actually large. One solution may be to resolve loop counts through profiling feedback. However, the more interesting result of this experiment is how rarely loop bound uncertainty affects performance. We observe that while nearly half the benchmarks contain loops with unknown bounds, in most cases this has no impact on locality, due to the patterns of data reuse within those loops.
For our experiments so far, the effective cache size has been set to 500 bytes, which is only a small fraction of the actual cache size (8 Kbytes). When the effective cache size is set to the full 8 Kbytes, our compiler generates identical code for 7 of the 13 benchmarks. For 5 of the 6 benchmarks that do change, the difference in performance is negligible. The one case that changes significantly is CFFT2D, as shown in Figure 6(b). In this case, fewer prefetches are issued with the larger effective cache size. However, the prefetches that are eliminated happen to be useful, since they fetch data that is replaced due to cache conflicts. As a result, the performance suffers, as we see Figure 6(b). (Note that this is in contrast with the effect we see in Figure 6(a), where issuing more prefetches hurts performance.) In the case of CFFT2D, many critical loops reference 2 Kbytes of data, and these loops happen to suffer from cache conflicts. An effective cache size of 500 bytes produces the desired result in this case. Overall, however, the results are robust with respect to effective cache size.
Finally, for our experiments in Section 4.1, we set the prefetch latency to 300 cycles. We chose a value greater than 75 cycles to account for bandwidth-related delays. To evaluate whether this was a good choice, we compiled each benchmark again using prefetch latencies of 100 and 1000 cycles. In nearly all the cases, the impact on performance is small. In many cases, the 100-cycle case is slightly worse than the 300-cycle case due to bandwidth-related delays. The most interesting case is CHOLSKY, as shown in Figure 6(c). In this case, prefetched data tends to be replaced from the cache shortly after it arrives, so ideally it should arrive ``just in time''. Therefore, the lowest prefetch latency (100 cycles) offers the best the performance, as we see in Figure 6(c). However, in such cases the best approach may be to eliminate the cache conflicts cause this behavior.
In summary, the performance of our selective algorithm was affected noticeably in only one of the 13 benchmarks for each parameter we varied. Overall, the algorithm appears to be quite robust.