The basic simulation algorithm is to determine the execution time of each instruction in a trace. The completion time of the last instruction to execute is the total execution time for the trace. The resulting parallelism is the ratio of the sequential execution time to the parallel execution time. Since we want to measure the actual parallelism and not the speedup for a realistic machine, we use one clock cycle latencies for all instructions. With non-unit latencies, some of the parallelism would be used to fill pipeline bubbles. The instructions removed because of our perfect inlining and perfect unrolling do not contribute to the sequential time, so the parallelism does not include any speedup due to removing those instructions.
Our experiment allows instructions arbitrarily far apart in the program trace to execute in parallel. This is necessary to detect parallelism that could be exposed by aggressive compiler transformations, such as loop interchange. Since the simulator cannot record the data dependences in a limited scheduling window, it records the time of the most recent write to each register and memory location. A large hash table is used to record writes to memory. Each bucket in this table may contain entries for several different locations, and additional space is allocated if a bucket overflows.
For each instruction, the simulator first determines when the operands are available. This is straightforward for register operands. For memory operands, we read the actual address from the trace and check the hash table for the time of the last write to that address. Next the simulator determines when the control flow constraints are satisfied. For the BASE machine, the execution time of the most recent branch in the trace is recorded, and all subsequent instructions must wait until that time. The implementations of the control flow constraints for the other machine models are described below. Given the constraints of control flow and operand availability, the instruction execution time is the minimum time satisfying all of these constraints. Finally, the simulator records when the instruction result is written. For store instructions, the actual address of the destination is read from the trace, and the time is entered into the hash table.