Procedure calls and loops are two control constructs that are commonly implemented in ways that introduce unnecessarily serializing constraints. These constraints can be eliminated by alternate implementation techniques and simple transformations such as procedure inlining and loop unrolling. Our study tries to model an upper bound of these techniques.
Procedure calls and returns introduce unnecessary control flow and cause problems for branch prediction. Furthermore, a new stack frame is typically allocated and deallocated in every procedure by incrementing and decrementing the stack pointer. Since there is a true data dependence between every increment and decrement, these stack pointer manipulations must be executed sequentially. For programs with many small procedures, this may limit the parallelism. To simulate the optimal case of inlining all procedures, including recursive procedures, we ignore all call and return instructions in a trace, as well as all instructions that manipulate the position of the stack pointer.
Parallelism in many loops is inhibited by dependences on the loop index variables. Loop index variables and induction variables are incremented in every loop iteration, creating true data dependences between iterations. However, these dependences are only a result of the way code is generated for scalar processors and are not an inherent part of the loop semantics. Loop unrolling is a simple technique that compilers use to reduce such effects.
In our experiment, we simulate perfect and complete unrolling. First, we analyze the object code to discover the loops in the program. We then use iterative data flow analysis to identify registers that are incremented by a constant exactly once per loop iteration. We only check the registers and not memory locations because we assume that the compiler has allocated the index variables in registers. If the index variables could not be register allocated, the loop index update is unlikely to be in the critical path of the loop execution. Finally, the analysis marks all instructions that increment loop index and induction variables, comparisons of loop indices with loop invariant values, and branches based on the results of such comparisons. These instructions are ignored when they occur in the trace.
Admittedly, many more transformations could be applied. A sophisticated compiler may be able to translate a program into an equivalent one that is more amenable to parallelization. This experiment does not, by any means, establish the limit of parallelism for the problem solved by a program. Finding such a limit is an undecidable problem. This experiment only establishes the limits of parallelism for the particular code generated by the compiler and subject to all of the assumptions in our experiment. If the resulting limits are considered inadequate, other techniques not considered in this study must be used.