I have shown how to reduce the hardware required from to
PEs, but we can reduce it still further. Suppose this time that
we feed the left half of the inputs in on the first eight clock
cycles, and the right half of the inputs on the second eight clock
cycles, as shown in Figure 2.5. Except for a data
dependency between the nodes on the line dividing the array in half,
the two halves operate independently, and are not in use at the same
time. We can use short term memory to store the results that cross
the dividing line, and then the array can be executed with half the
number of processors as was done earlier. Again the PEs must switch
tasks as necessary. This kind of folding is called widthwise
folding.
The properties of widthwise folding are:
Depthwise folding and widthwise folding can be combined arbitrarily to execute the desired array with as little as a single PE. This brings us to the first major property of an origami array:
The amount of time required to execute an origami array is inversely proportional to the number of processors available, assuming the width and height of the desired array are perfect multiples of the width and height of the available processor array.