Parallel algorithms / interactive lesson

Cannon's Matrix Multiplication

Multiply two n×n matrices using p = q² processors on a q×q torus mesh. Watch the initial skew align data across the mesh, then follow q rounds of simultaneous local multiply-accumulate interleaved with nearest-neighbour shifts — all with zero global communication.

University Guided lesson

Step 1/110 / Initial layout

A rows rotate leftB columns rotate upactive multiplyC result

0 / Initial layout

Initial data layout

Each processor P(i,j) starts with its own block A[i][j] and B[i][j]. For C[i][j] = Σₖ A[i][k]·B[k][j] each processor needs q different pairs — that is exactly what the initial skew and the rotating shifts will provide.

Steps

Communication this step

No communication yet. Every processor owns A[i][j] and B[i][j] in local memory.

Key insight

Cannon's algorithm works on a q×q torus mesh (p = q² processors). Each processor will perform exactly q local multiply-accumulate operations, one per shift round. Serial work W = n³, parallel work Tp = n³/p.

Complexity (hypercube / torus)

T_p = n³/p + (q+1)*t_s + (q+1)*t_v*n²/p= Θ(n³/p + √p*t_s + n²/√p*t_v)T_init(skew) = 2*(q-1)*(t_s + t_v*w²)T_shift/round = t_s + t_v*w² (q-1 rounds)T_compute = n³/pIsoefficiency: W = Θ(p^1.5)Max processors: p = O(n²)

About the algorithm

Cannon's algorithm (Lynn Elliot Cannon, 1969) multiplies two n×n matrices using p = q² processors arranged in a q×q torus mesh. Each processor P(i,j) computes the single scalar C[i][j] = Σₖ A[i][k]·B[k][j] by performing q local multiply-accumulate rounds. After an initial skew, blocks rotate through the mesh one step at a time — A shifts left, B shifts up — so that every processor always holds the correct pair for the next round. The result is C[i][j] accumulating locally, with no reductions needed.

Torus mesh topology

Processors are connected in a q×q 2-D torus: each processor has four neighbours (left, right, up, down) with wrap-around edges. Communication is strictly nearest-neighbour — each shift moves data exactly one hop. This makes Cannon ideal for physical mesh networks and is why it appears in classic parallel computing textbooks alongside hypercube mappings.

Initial skew and rotating shifts

Before the first multiply, row i of A is cyclically shifted left by i positions, and column j of B is cyclically shifted up by j positions. This one-time skew aligns the data so that the correct A-block and B-block meet at every processor simultaneously. After each multiply round, all A-blocks shift left by one and all B-blocks shift up by one — simple, regular, and contention-free on the torus.

Complexity and isoefficiency (hypercube analysis)

Serial work: W = n³. Parallel time: Tₚ = n³/p + (q+1)·tₛ + (q+1)·tᵥ·(n/q)² = Θ(n³/p + √p·tₛ + n²/√p·tᵥ), where q = √p. The communication overhead grows as Θ(√p) in latency and Θ(n²/√p) in bandwidth. To maintain constant efficiency E while scaling p, the problem size must satisfy W = Θ(p^1.5) — this is the isoefficiency function. Maximum practical parallelism: p = O(n²) processors (one per scalar element).

Comparison with DNS

DNS uses p = n³ processors and achieves Tₚ = Θ(log n), but requires 3-D broadcast and reduce communication. Cannon uses only p = n² processors, performs strictly nearest-neighbour shifts, and achieves Tₚ = Θ(n³/p) = Θ(n) with much lower memory overhead (3·(n/q)² values per processor). Cannon is preferred when bandwidth is the bottleneck; DNS is preferred when minimising time is paramount.

Code companion

Connect each visual decision to an implementation pattern.

// q×q torus mesh, p = q² processors
// P(i,j) executes:

// Phase 1 – Initial skew of A
shift A left cyclically by i steps in row i

// Phase 2 – Initial skew of B
shift B up cyclically by j steps in column j

// Phase 3 – q rounds of multiply-shift
for round = 1 to q:
    C[i][j] += A_local * B_local     // local multiply-accumulate
    shift A_local left by 1           // torus ring shift
    shift B_local up by 1             // torus ring shift

// P(i,j) now holds C[i][j]