If we feed the inputs at half the full rate, as shown in
Figure 2.2
and allow the data to propagate, several
observations can be made. After four clock cycles, the bottom half of
the array is unused. After eight clock cycles, the top half of the
array is unused. These cases are shown in
Figure 2.3. Thus, since only half the hardware
is in use at any given time, and all the PEs are identical, it should
be possible to implement the system using only half the number of
processors.
One way to do this is illustrated in Figure 2.4. Half the number of processors are available, and the outputs of each processor are fed back to the inputs of the array through multiplexers. If the input multiplexors are set on ``data input'' for four clock cycles, and then switched to ``feedback input'' for four clock cycles, the array will perform exactly as before, but with half the throughput rate and half the number of PEs. The PEs must switch tasks as appropriate to guarantee that the data is processed properly. The latency will remain constant.
This kind of vertical ``folding,'' called depthwise folding, can be performed to make the processor array arbitrarily small vertically. The properties of depthwise folding are: