This paper proposes a compiler algorithm to insert prefetch instructions in scientific code. In particular, we focus on those numerical algorithms that operate on dense matrices. Various algorithms have previously been proposed for this problem . In this work, we improve upon previous algorithms and evaluate our algorithm in the context of a full optimizing compiler. We also study the interaction of prefetching with other data locality optimizations such as cache blocking.
There are a few important concepts useful for developing prefetch algorithms. Prefetches are possible only if the memory addresses can be determined ahead of time. Prefetches are unnecessary if the data are already in the cache (or are being fetched into the cache) at the time of prefetch. Even if a necessary prefetch is issued, it may not be effective; it may be too early and the prefetched datum is replaced from the cache before it is used; it may be too late and the datum is used before it has been fetched from memory.
The domain of this work is the set of array accesses whose indices are affine (i.e. linear) functions of the loop indices. A substantial number of data references in scientific code belong to this domain. These references have the property that their addresses can be determined ahead of time, and prefetching these locations is therefore possible.
While these accesses are responsible for many of the cache misses, a significant number of them are also hits, as shown in Table 1. This table suggests that if prefetches were issued for all the affine array accesses, then over 60%of the prefetches would be unnecessary for most of the programs. It is important to minimize unnecessary prefetches. Unnecessary prefetches incur a computation overhead due to the prefetches themselves and instructions needed to calculate the addresses. Prefetching can also increase the demand for memory bandwidth, which can result in delays for normal memory accesses as well as prefetches.
We have developed a compiler algorithm that uses locality analysis to selectively prefetch only those references that are likely to cause cache misses. To schedule the prefetch instructions early enough, we use the concept of software pipelining, where the computation of one iteration overlaps with the prefetches for a future iteration.
We have implemented our prefetch algorithm in the SUIF (Stanford University Intermediate Form) compiler. The SUIF compiler includes many of the standard optimizations and generates code competitive with the MIPS compiler. Using this compiler system, we have been able to generate fully functional and optimized code with prefetching. (For the sake of simulation, prefetch instructions are encoded as loads to R0.) By simulating the code with a detailed architectural model, we can evaluate the effect of prefetching on overall system performance. It is important to focus on the overall performance, because simple characterizations such as the miss rates alone are often misleading. We have also evaluated the interactions between prefetching and locality optimizations such as blocking.
The organization of the rest of the paper is as follows. Section 2 describes our compiler algorithm, Section 3 describes our experimental framework, and Section 4 presents our results. In Section 4, we evaluate the effectiveness of our compiler algorithm on two variations of hardware support, and characterize the robustness of the algorithm with respect to its compile-time parameters. We also evaluate the interaction between prefetching and blocking. Section 5 discusses related work, and includes a comparison between software and hardware prefetching techniques. Section 6 describes future work to improve our algorithm, and finally, we conclude in Section 7.