The potential for parallelism can be evaluated from two different perspectives. The first is to propose an actual system design; the performance of such a system is a lower bound on the amount of available parallelism. The second is to perform studies on the limits of parallelism for a particular approach by enforcing only the constraints associated with the approach and relaxing all other constraints. If the upper bound of performance is found to be unacceptably low, then the approach is inadequate. On the other hand, if the upper bound is within expectations, the only conclusion is that the approach may be sufficient. It is only from low limits that we can draw interesting conclusions.
A recent study by Wall  contains a surprising result that suggests a severe limitation in current approaches. One of the experiments in that study examines the performance of a processor that uses an aggressive hardware algorithm to predict the outcomes of branches. Instructions along the predicted path of a branch are speculatively executed before the branch is resolved. The processor has perfect memory disambiguation, perfect register renaming, unlimited instruction fetching, and a very large number of functional units. The reported speedups for this processor on a set of non-numeric programs range from 4.1 to 7.4. Given the assumptions of perfect memory disambiguation and a large number of functional units, these speedups are quite small.
Wall's result contrasts sharply with experiments that assume perfect branch prediction . Since a machine with perfect branch prediction requires foreknowledge, we refer to such a machine as an oracle. The effects of control flow on parallelism are essentially eliminated on an oracle machine because all of the branch outcomes are known in advance. Much more parallelism is available on an oracle machine, suggesting that the bottleneck in Wall's experiment is due to control flow.
The ultimate goal of our study is to discover ways to increase parallelism by an order of magnitude beyond current approaches. In this paper, we focus on control flow. Through a study of the limits of parallelism, we hope to establish the inadequacy of current approaches in handling control flow and identify promising new directions.
This paper offers evidence that the limit of parallelism discussed in Wall's paper is likely to apply to many more non-numeric programs than those measured. A processor that uses speculation to only exploit local parallelism found between mispredicted branches is fundamentally limited. A compiler can locate more global parallelism in the code by extracting the true control dependence of the program. However, this increased opportunity for parallelization cannot be fully exploited unless the machine can simultaneously pursue multiple flows of control. The degree of parallelism is fundamentally limited by the von Neumann architecture. Higher performance can be obtained with machines that can simultaneously follow multiple flows of control.
In this paper, we analyze the techniques of speculative execution, control dependence analysis, and following multiple flows of control. We evaluate these techniques empirically by computing the limits of parallelism for a set of programs on machines that employ different combinations of these techniques. So far, research on superscalar and multiprocessor architectures has mostly been conducted independently. This work studies how instruction and processor levels of parallelism interact. Our study also suggests that a useful characteristic for predicting the parallelism in a program is whether the control flow is data dependent.
We first introduce the major concepts of the paper in Section 2. We explain speculative execution, control dependence analysis, and pursuing multiple flows of control. Section 3 presents a set of abstract machine models that use these three techniques in various combinations. In Section 4, we describe the experimental framework used to evaluate the limits of parallelism. Section 5 presents the results of these experiments and analyzes the parallelism for each machine model. Finally, Section 6 presents the conclusions of our study and describes its implications for some specific architectures that have been proposed and implemented.