Next: Loop Splitting Up: Performance of Prefetching Previous: Performance of Prefetching

Locality Analysis

The goal of using locality analysis is to eliminate prefetching overhead by prefetching only those references that cause cache misses. To evaluate the effectiveness of this analysis, we performed the following experiment. We implemented a compiler algorithm which prefetches all data references within our domain (i.e. array references whose indices are affine functions of loop indices). We refer to this algorithm as indiscriminate (as opposed to selective) prefetching. The indiscriminate algorithm uses the same software pipelining technique as selective prefetching to schedule prefetches far enough in advance. However, it has no need for locality analysis or loop splitting.

The results of this experiment are shown in Figure 4 and in Tables 3 and 4. The ideal selective prefetch algorithm would achieve the same level of memory stall reduction while decreasing the overheads associated with issuing unnecessary prefetches.

Figure 4 shows that the speedup offered by prefetching selectively rather than indiscriminately ranges from 1%to 100%. In 6 of the 13 cases, the speedup is greater than 20%. Table 3 shows that most of the benchmarks sacrifice very little in terms of memory stall reduction by prefetching selectively. On the other hand, Figure 4 shows that indiscriminate prefetching suffers more from both increased instruction overhead and stress on the memory subsystem. Overall, selective prefetching is effective. In some cases (CFFT2D and MG), selectiveness even turns prefetching from a performance loss into a performance gain.

We evaluate the selective algorithm in more detail by using the following two concepts. The coverage factor is the fraction of original misses that are prefetched. A prefetch is unnecessary if the line is already in the cache or is currently being fetched. An ideal prefetching scheme would provide a coverage factor of 100%and would generate no unnecessary prefetches.

Table 4 contains several statistics about the effectiveness of the two prefetching algorithms. This includes the percentage of prefetches issued that are unnecessary, and a breakdown of the impact of prefetching on the original cache misses. This breakdown contains three categories: (1) those that are prefetched and subsequently hit in the primary cache (pf-hit), (2) those that are prefetched but remained misses (pf-miss), and (3) those that are not prefetched (nopf-miss). The coverage factor is equal to the sum of the pf-hit and pf-miss categories.

The coverage factor of the indiscriminate case is interesting, since it represents the fraction of cache misses that are within the domain of our analysis in most cases. Looking at Table 4, we notice that in 5 of the 13 cases the coverage factor is well over 90%, but in 5 of the other cases it is less than 50%. In the case of CG, the coverage is only 38%because it is a sparse matrix algorithm and we are only prefetching the dense index arrays. The same is true for IS. MXM (a blocked matrix multiplication kernel) is a surprising case, since all of the important references are obviously affine, yet the coverage factor is only 65%. This is a result of the way we account for prefetches in our experiment; we associate a prefetch only with the very next reference to the same cache line. Suppose the algorithm issues two prefetches for the same line followed by references to two consecutive words in that cache line, we say that the first reference is prefetched but not the second. In the case of MXM, cache conflicts between accesses to different arrays can cause the second access to the same cache line to miss. Similar behavior also occurs in TOMCATV, OCEAN, and MG. Finally, in the cases of EMIT and EP, many of the remaining cache misses occur in library routines rather than the programs themselves.

Table 4 shows that a large fraction of prefetches issued under the indiscriminate scheme are unnecessary. In all but one case, this fraction ranged from 60%to 95%. These unnecessary prefetches can lead to large instruction overheads (MG) or significant delays due to a saturated memory subsystem (CFFT2D and CG).

A selective algorithm is successful if it can maintain a similar coverage while lowering the number of unnecessary prefetches. Table 4 shows that in 11 of the 13 cases, the coverage is reduced by less than 10%. Table 3 also supports this by showing that the miss rates have not increased substantially, and the reduction in memory stall cycles is comparable. In the cases where the coverage did go down, the problem is typically due to the presence of cache conflicts. Comparing the percentages of unnecessary prefetches in Table 4, we see that the improvement from selective prefetching is dramatic in many cases (CFFT2D, CHOLSKY, IS, EP, MG). (Note that these percentages are computed with respect to the number of prefetches issued, which changes between the two cases.)

The advantage of selective prefetching is summarized by the ratio of indiscriminate to selective prefetches in Table 4. Prefetching selectively can reduce the number of prefetches by as much as a factor of 21. At the same time, the coverage factor remains competitive. Overall, this selection process appears to be quite successful.

Next: Loop Splitting Up: Performance of Prefetching Previous: Performance of Prefetching

Robert French