CSE613: Parallel Programming

Homework #3

Task 1. [ 120 Points ] Distributed-Memory Matrix Multiplication.

(a) [ 50 Points ] Implement the three distributed-memory algorithms for multiplying two square

matrices shown in Figures 1–3. Assume the initial distribution of input matrices and the final

distribution of the output matrix from lecture 13.

(b) [ 10 Points ] Use your implementations from part 1(a) to multiply two 2k × 2

k matrices

(initialized with random integers in [−100, 100]) on 2l×2

l

compute nodes with 1 process/node

for 10 ≤ k ≤ 14 and 0 ≤ l ≤ 2. Report the running times and explain your findings.

(c) [ 10 Points ] Repeat part 1(b) with t processes/node, where t is the number of cores available

on a compute node. Report the running times. Compare with part 1(b) and explain.

(d) [ 20 Points ] Suppose a master node initially holds the input matrices and will hold the final

output matrix. Augment your fastest implementation from part 1(a) with efficient routines

for initial distribution and final collection of matrices. When you measure the running time

of this algorithm please include the time needed for these additional distribution/collection

steps.

(e) [ 10 Points ] Repeat part 1(b) with the algorithm from part 1(d).

(f) [ 10 Points ] Repeat part 1(c) with the algorithm from part 1(d).

(g) [ 10 Points ] Compare your results from parts 1(b, c) with those from parts 1(e, f). Explain

your findings.

Task 2. [ 80 Points ] Distributed-Shared-Memory Matrix Multiplication.

(a) [ 30 Points ] Modify your fastest implementation from part 1(a) and and its modified version

from part 1(d) to use a shared-memory parallel matrix multiplication algorithm inside each

process. Use your fastest shared-memory parallel matrix multiplication routine from HW1.

(b) [ 40 Points ] Repeat part 1(b) with the two implementations from part 2(a). Use 1 process/node, but inside each process use all cores available on that node.

(c) [ 10 Points ] Compare your results from parts 1(b, c, e, f) with those from part 2(b). Explain

your findings.

1

Figure 1: Distributed matrix multiplications using block rotations for both input matrices.

Figure 2: Distributed matrix multiplications using block rotations for one input matrices and block

broadcasts for the other.

Figure 3: Distributed matrix multiplications using block broadcasts for both input matrices.

2

APPENDIX 1: What to Turn in

One compressed archive file (e.g., zip, tar.gz) containing the following items.

– Source code, makefiles and job scripts for both tasks.

– A PDF document containing all answers.

APPENDIX 2: Things to Remember

– Please never run anything that takes more than a minute or uses multiple cores

on login nodes. Doing so may result in account suspension. All runs must be submitted as

jobs to compute nodes (even when you use Cilkview or PAPI).

– Please store all data in your work folder ($WORK), and not in your home folder ($HOME).

3