Multiply two n×n matrices using n³ processors on a 3-D hypercube. Watch how two broadcast phases deliver operands to every processor, a single local multiply produces partial products, and two reduction phases accumulate the final result — all in Θ(log n) parallel time.
University
Guided lesson
Step 1/70 / Data layout
3-D hypercube grid · drag to rotate · scroll to zoom
A[i][k]B[k][j]partialC[i][j]k-tower
0 / Data layout
0 messages
Initial data layout: A in plane j=0, B in plane i=0
Each of the 3³ = 27 processors is assigned coordinates (i, k, j). Matrix A is stored in the j=0 face: processor P(i,k,0) owns A[i][k]. Matrix B is stored in the i=0 face: processor P(0,k,j) owns B[k][j]. Processors with both i≠0 and j≠0 hold no data yet.
About the algorithm
The DNS algorithm (Dekel, Nassimi, Sahni, 1981) multiplies two n×n matrices using n³ processors arranged in a 3-D grid. Each processor P(i,k,j) computes one partial product A[i][k]·B[k][j]. Two broadcast phases distribute the data, one local multiply computes the partial products, and two reduction phases accumulate the final results. The entire computation takes only Θ(log n) parallel time.
Why a 3-D grid?
A 2-D grid of n² processors would require each processor to handle an entire dot product of length n. The 3-D grid adds a third dimension k that parallelises the dot product itself, reducing per-processor work to a single multiply. The k-axis then carries the reduction back to the output plane k=0.
Hypercube topology
DNS is analysed on a hypercube with p = n³ processors. Each axis of the 3-D grid corresponds to a set of hypercube dimensions. Broadcasting along one axis and reducing along another each take Θ(log n) steps on the hypercube, giving total communication cost Θ(log n) — logarithmic in matrix size regardless of n.
Isoefficiency & scalability
Efficiency E = 1/log n decreases as n grows. To maintain constant efficiency while scaling processors p = n³, the problem size W = n³ must grow as Θ(p log p^(1/3)). DNS is not perfectly efficient but is highly practical: for realistic matrix sizes the log n overhead is small compared to the n³ serial work.
Code companion
Connect each visual decision to an implementation pattern.
// p = q³ processors, q = n// P(i,k,j) executes:// Phase 1 – Broadcast A along j-axisifj==0:sendA[i][k]toP(i,k,1..q-1)else:receiveA[i][k]fromP(i,k,0)// Phase 2 – Broadcast B along i-axisifi==0:sendB[k][j]toP(1..q-1,k,j)else:receiveB[k][j]fromP(0,k,j)// Phase 3 – Local multiplypartial=A[i][k]*B[k][j]// Phase 4 – Reduce along k-axis to k=0reduce_sum(partial) → P(i,0,j)// P(i,0,j) owns C[i][j]