To experiment with prefetching, we extend our base R4000 architecture (described previously in Section 1.1) as follows. We augment the instruction set to include a prefetch instruction that uses a base-plus-offset addressing format and is defined to not take any memory exceptions. Both levels of the cache are lockup-free in the sense that multiple prefetches can be outstanding. The primary cache is checked in the cycle the prefetch instruction is executed. If the line is already in the cache, the prefetch is discarded. Otherwise, the prefetch is sent to a prefetch issue buffer, which is a structure that maintains the state of outstanding prefetches. For our study, we assume a rather aggressive design of a prefetch issue buffer that contains 16 entries. If the prefetch issue buffer is already full, the processor is stalled until there is an available entry. (Later, in Section 4.3, we compare this with an architecture where prefetches are simply dropped if the buffer is full.) The secondary cache is also checked before the prefetch goes to memory. We model contention for the memory bus by assuming a maximum pipelining rate of one access every 20 cycles. Once the prefetched line returns, it is placed in both levels of the cache hierarchy. Filling the primary cache requires 4 cycles of exclusive access to the cache tags-during this time, the processor cannot execute any loads or stores.
Since regular cache misses stall the processor, they are given priority over prefetch accesses both for the memory bus and the cache tags. We assume, however, that an ongoing prefetch access cannot be interrupted. As a result, a secondary cache miss may be delayed by as many as 20 cycles (memory pipeline occupancy time) when it tries to access memory. Similarly the processor may be stalled for up to 4 cycles (cache-tag busy time) when it executes a load or store. If a cache miss occurs for a line for which there is an outstanding prefetch waiting in the issue buffer, the miss is given immediate priority and the prefetch request is removed from the buffer. If the prefetch has already been issued to the memory system, any partial latency hiding that might have occurred is taken into account.
As we mentioned before in Section 1, the benchmarks evaluated in this paper are all scientific applications taken from the SPEC, SPLASH and NAS Parallel benchmark suites. For four of the benchmarks (MXM, CFFT2D, VPENTA and TOMCATV), we manually changed the alignment of some of the matrices to reduce the number of cache conflicts.
The prefetching algorithm has a few compile-time parameters, which we consistently set as follows: cache line size = 32 bytes, effective cache size = 500 bytes, and prefetch latency = 300 cycles. As discussed in Section 2.1.2, we choose an effective cache size to be a fraction of the actual size (8 Kbytes) as a first approximation to the effects of cache conflicts (we consider the effects of varying this parameter in Section 4.2). The prefetch latency indicates to the compiler how many cycles in advance it should try to prefetch a reference. The prefetch latency is larger than 75 cycles, the minimum miss-to-memory penalty, to account for bandwidth-related delays.