In various examples, the processing engine array 510 uses systolic execution, in which data arrives at each processing engine 511 from different directions at regular intervals. In some examples, input data can flow into the processing engine array 510 from the left and weight values can be loaded at the top. In some examples weights and input data can flow from the left and partial sums can flow from top to bottom. In these and other examples, a multiply-and-accumulate operation moves through the processing engine array 510 as a diagonal wave front, with data moving to the right and down across the array. Control signals can be input at the left at the same time as weights 506, and can flow across and down along with the computation.
In various implementations, the number of columns in the processing engine array 510 determines the computational capacity of the processing engine array 510, and the number of rows determines the required memory bandwidth for achieving maximum utilization of the processing engine array 510. The processing engine array 510 can have, for example, 64 columns and 428 rows, or some other number of columns and rows.