While software-controlled prefetching schemes require support from both hardware and software, several schemes have been proposed that are strictly hardware-based. Porterfield  evaluated several cacheline-based hardware prefetching schemes. In some cases they were quite effective at reducing miss rates, but at the same time they often increased memory traffic substantially. Lee  proposed an elaborate lookahead scheme for prefetching in a multiprocessor where all shared data is uncacheable. He found that the effectiveness of the scheme was limited by branch prediction and by synchronization. Baer and Chen  proposed a scheme that uses a history buffer to detect strides. In their scheme, a ``look ahead PC'' speculatively walks through the program ahead of the normal PC using branch prediction. When the look ahead PC finds a matching stride entry in the table, it issues a prefetch. They evaluated the scheme in a memory system with a 30 cycle miss latency and found good results.
Hardware-based prefetching schemes have two main advantages over software-based schemes: (1) they have better dynamic information, and therefore can recognize things such as unexpected cache conflicts that are difficult to predict in the compiler, and (2) they do not add any instruction overhead to issue prefetches.
However, the hardware-based schemes have several important disadvantages. The primary difficulty is detecting the memory access patterns. The only case where it does reasonably well is for constant-stride accesses. However, for the types of applications where constant-stride accesses are dominant, the compiler is quite successful at understanding the access patterns, as we have shown. Additionally, in the future our compiler will be able to prefetch complex access patterns such as indirection which the hardware will not be able to recognize. Hardware-based schemes also suffer from a limited scope. Branch prediction is successful for speculating across a few branches, but when memory latency is on the order of hundreds of cycles, there is little hope of predicting that many branches correctly. Also, hardware-based schemes must be ``hard-wired'' into the processor. For a commercially available microprocessor that is targeted for many different memory systems, this lack of flexibility can be a serious limitation-not only in terms of tuning for different memory latencies, but also a prefetching scheme that is appropriate for a uniprocessor may be entirely inappropriate for a multiprocessor . Finally, while hardware-based schemes have no software cost, they may have a significant hardware cost, both in terms of chip area and possibly gate delays.