The results of our first set of experiments are shown in Figure 3 and Table 2. Figure 3 shows the overall performance improvement achieved through our selective prefetching algorithm. For each benchmark, the two bars correspond to the cases with no prefetching (N) and with selective prefetching (S). In each bar, the bottom section is the amount of time spent executing instructions (including instruction overhead of prefetching), and the section above that is the memory stall time. For the prefetching cases, there is also a third component-stall time due to memory overheads caused by prefetching. Specifically, the stall time corresponds to two situations: (1) when the processor attempts to issue a prefetch but the prefetch issue buffer is already full, and (2) when the processor attempts to execute a load or store when the cache tags are already busy with a prefetch fill.
As shown in Figure 3, the speedup in overall performance ranges from 5%to 100%, with 6 of the 13 benchmarks improving by over 45%. The memory stall time is significantly reduced in all the cases. Table 2 indicates that this is accomplished by reducing both the primary miss-rate and the average primary-miss penalty. The miss penalty is reduced because even if a prefetched line is replaced from the primary cache before it can be referenced, it is still likely to be present in the secondary cache. Also, the miss latency may be partially hidden if the miss occurs while the prefetch access is still in progress. Overall, 50%to 90%of the original memory stall cycles are eliminated.
Having established the benefits of prefetching, we now focus on the costs. Figure 3 shows that the instruction overhead of prefetching causes less than a 15%increase in instruction count in over half of the benchmarks. In fact, in two of those cases (MXM and IS) the number of instructions actually decreased due to savings through loop unrolling. In other cases (CHOLSKY, BTRIX, VPENTA, TOMCATV, OCEAN), the number of instructions increased by 25%to 50%. Finally, the stalls due to prefetching memory overhead are typically small-never more than 15%of original execution time. In each case, we observe that the overheads of prefetching are low enough compared to the gains that the net improvement remains large. In the following subsections, we present a detailed evaluation of each aspect of our selective prefetching algorithm.