Porterfield  presented a compiler algorithm for inserting prefetches. He implemented it as a preprocessing pass that inserted prefetching into the source code. His initial algorithm prefetched all array references in inner loops one iteration ahead. He recognized that this scheme was issuing too many unnecessary prefetches, and presented a more intelligent scheme based on dependence vectors and overflow iterations. Since the simulation occurred at a fairly abstract level, the prefetching overhead was estimated rather than presented. Overall performance numbers were not presented. Also, the more sophisticated scheme was not automated (the overflow iterations were calculated by hand) and did not take cache line reuse into account.
Klaiber and Levy  extended the work by Porterfield by recognizing the need to prefetch more than a single iteration ahead. They included several memory system parameters in their equation for how many iterations ahead to prefetch, and inserted prefetches by hand at the assembly-code level. The results were presented in terms of average memory access latency rather than overall performance. Also, in contrast with our study, they proposed prefetching into a separate fetchbuffer rather than directly into the cache. However, our results have demonstrated that prefetching directly into the cache can provide impressive speedups, and without the disadvantage of sacrificing cache size to accommodate a fetchbuffer.
Gornish, Granston and Veidenbaum  presented an algorithm for determining the earliest time when it is safe to prefetch shared data in a multiprocessor with software-controlled cache coherency. This work is targeted for a block prefetch instruction, rather than the single-line prefetches considered in this paper.
Chen et al.  considered prefetching of non-scientific code. They attempt to move address generation back as far as possible before loads to hide a small cache miss latency (10 cycles). The paper also suggested that prefetch buffers would be helpful for reducing the effects of cache pollution. It is considerably more difficult to generate addresses early in non-scientific code where the access patterns can be very irregular. At the same time, these applications tend to make better use of the cache in a uniprocessor.