Next: Widthwise Folding Up: Folding Previous: Folding

Depthwise Folding

If we feed the inputs at half the full rate, as shown in Figure 2.2 and allow the data to propagate, several observations can be made. After four clock cycles, the bottom half of the array is unused. After eight clock cycles, the top half of the array is unused. These cases are shown in Figure 2.3. Thus, since only half the hardware is in use at any given time, and all the PEs are identical, it should be possible to implement the system using only half the number of processors.

One way to do this is illustrated in Figure 2.4. Half the number of processors are available, and the outputs of each processor are fed back to the inputs of the array through multiplexers. If the input multiplexors are set on ``data input'' for four clock cycles, and then switched to ``feedback input'' for four clock cycles, the array will perform exactly as before, but with half the throughput rate and half the number of PEs. The PEs must switch tasks as appropriate to guarantee that the data is processed properly. The latency will remain constant.

This kind of vertical ``folding,'' called depthwise folding, can be performed to make the processor array arbitrarily small vertically. The properties of depthwise folding are:



Next: Widthwise Folding Up: Folding Previous: Folding


Robert French