Parallel algorithms / interactive lesson

DNS Matrix Multiplication

Multiply two n×n matrices using n³ processors on a 3-D hypercube. Watch how two broadcast phases deliver operands to every processor, a single local multiply produces partial products, and two reduction phases accumulate the final result — all in Θ(log n) parallel time.

University Guided lesson

Step 1/70 / Data layout

3-D hypercube grid · drag to rotate · scroll to zoom

A[i][k]B[k][j]partialC[i][j]k-tower

0 / Data layout

0 messages

Initial data layout: A in plane j=0, B in plane i=0

Each of the 3³ = 27 processors is assigned coordinates (i, k, j). Matrix A is stored in the j=0 face: processor P(i,k,0) owns A[i][k]. Matrix B is stored in the i=0 face: processor P(0,k,j) owns B[k][j]. Processors with both i≠0 and j≠0 hold no data yet.

processors

messages

0/9

C ready

Input/Output matrices

A matrix — stored in j=0 face

B matrix — stored in i=0 face

Track result element C[i][j]

Select a C element to highlight its k-tower of processors in the graph.

Communication this step

Communication: none. This is the initial data mapping.

Key insight

The coordinate assignment is not arbitrary: i maps to the result row, k to the dot-product index, j to the result column. This alignment makes the subsequent broadcasts and reduction a simple sweep along a single axis each time.

Complexity & isoefficiency

T₁ = n³ , T_p = Θ(log n) , speedup S = n³/log nEfficiency: E = 1/log nTopology: 3-D hypercube, p = n³Isoefficiency: W = Θ(p log p)

About the algorithm

The DNS algorithm (Dekel, Nassimi, Sahni, 1981) multiplies two n×n matrices using n³ processors arranged in a 3-D grid. Each processor P(i,k,j) computes one partial product A[i][k]·B[k][j]. Two broadcast phases distribute the data, one local multiply computes the partial products, and two reduction phases accumulate the final results. The entire computation takes only Θ(log n) parallel time.

Why a 3-D grid?

A 2-D grid of n² processors would require each processor to handle an entire dot product of length n. The 3-D grid adds a third dimension k that parallelises the dot product itself, reducing per-processor work to a single multiply. The k-axis then carries the reduction back to the output plane k=0.

Hypercube topology

DNS is analysed on a hypercube with p = n³ processors. Each axis of the 3-D grid corresponds to a set of hypercube dimensions. Broadcasting along one axis and reducing along another each take Θ(log n) steps on the hypercube, giving total communication cost Θ(log n) — logarithmic in matrix size regardless of n.

Isoefficiency & scalability

Efficiency E = 1/log n decreases as n grows. To maintain constant efficiency while scaling processors p = n³, the problem size W = n³ must grow as Θ(p log p^(1/3)). DNS is not perfectly efficient but is highly practical: for realistic matrix sizes the log n overhead is small compared to the n³ serial work.

Code companion

Connect each visual decision to an implementation pattern.

// p = q³ processors, q = n
// P(i,k,j) executes:

// Phase 1 – Broadcast A along j-axis
if j == 0: send A[i][k] to P(i,k,1..q-1)
else:      receive A[i][k] from P(i,k,0)

// Phase 2 – Broadcast B along i-axis
if i == 0: send B[k][j] to P(1..q-1,k,j)
else:      receive B[k][j] from P(0,k,j)

// Phase 3 – Local multiply
partial = A[i][k] * B[k][j]

// Phase 4 – Reduce along k-axis to k=0
reduce_sum(partial) → P(i,0,j)
// P(i,0,j) owns C[i][j]