CSE613: Parallel Programming

Homework #1

Task 1. [ 100 Points ] Multiplying Matrices.

(a) [ 10 Points ] Figure 1 shows the standard iterative matrix multiplication algorithm and its 5

variants obtained by permuting the three nested for loops in all possible ways. Assume that

all three matrices (X, Y and Z) are given in row-major order. Run each implementation on

matrices of size 2r × 2

r

, where r is the largest integer such that none of the implementations

takes more than five minutes to perform the multiplication. Tabulate all running times2

.

(b) [ 10 Points ] Use PAPI3

to find the L1/L2/L3 misses incurred by each implementation from

part 1(a) when run on matrices of size 2r × 2

r

, where r is as defined in part 1(a). Tabulate

your results.

(c) [ 5 Points ] Compare and explain your findings from parts 1(a) and 1(b).

(d) [ 10 Points ] Take the two fastest implementations from part 1(a), and try to parallelize each

simply by replacing one or more serial for loops with parallel for loops. In how many ways

can you correctly parallelize each implementation? Now run each such parallel implementation

on all cores by varying n from 24

to 2s

(consider only powers of 2), where s is the largest

integer such that none of the parallel implementations takes more than a minute to perform

the multiplication. Tabulate or plot the running times.

(e) [ 10 Points ] Run each parallel implementation from part 1(d) on a 2r × 2

r

input matrix by

varying p from 1 to the maximum number of cores available on the machine4

, where r is as

defined in part 1(a). Tabulate or plot the running times.

(f) [ 5 Points ] Explain your findings from parts 1(d) and 1(e).

(g) [ 30 Points ] Implement the in-place parallel recursive matrix multiplication algorithm shown

in Figure 2 (Par-Rec-MM). But do not recurse down to matrices of size 1 × 1. Instead stop

at a base case of size m × m for some m 0 in order to reduce the overhead of recursion.

When you reach a base case of size m × m use one of the fastest serial algorithms from part

1(a) for multiplying the two subatrices. For simplicity let’s assume that both n and m are

powers of 2. For n = 2r

(where r is as defined in part 1(a)), empirically find the value of

1We will not use hyperthreading in this assignment.

2When you implement and compile these you must make sure that the compiler does not change the order in

which the loops are nested in each of them

3PAPI is installed on both Stampede2 and Comet (please check by running “module avail”). You will have to

first load PAPI (by running “module load”) in order to use it.

4

Cilk allows you to vary the number of worker threads (available cores) from the command line of your own

program.

1

Iter-MM-ijk( Z, X, Y, n )

(Inputs are three n × n matrices X, Y and Z. For each i, j ∈ [1, n], Z[i, j] is set to Z[i, j] + Pn

k=1 X[i, k] × Y [k, j].)

1. for i ← 1 to n do

2. for j ← 1 to n do

3. for k ← 1 to n do

4. Z[i, j] ← Z[i, j] + X[i, k] × Y [k, j]

Iter-MM-ikj( Z, X, Y, n )

(Inputs are three n × n matrices X, Y and Z. For each i, j ∈ [1, n], Z[i, j] is set to Z[i, j] + Pn

k=1 X[i, k] × Y [k, j].)

1. for i ← 1 to n do

2. for k ← 1 to n do

3. for j ← 1 to n do

4. Z[i, j] ← Z[i, j] + X[i, k] × Y [k, j]

Iter-MM-jik( Z, X, Y, n )

(Inputs are three n × n matrices X, Y and Z. For each i, j ∈ [1, n], Z[i, j] is set to Z[i, j] + Pn

k=1 X[i, k] × Y [k, j].)

1. for j ← 1 to n do

2. for i ← 1 to n do

3. for k ← 1 to n do

4. Z[i, j] ← Z[i, j] + X[i, k] × Y [k, j]

Iter-MM-jki( Z, X, Y, n )

(Inputs are three n × n matrices X, Y and Z. For each i, j ∈ [1, n], Z[i, j] is set to Z[i, j] + Pn

k=1 X[i, k] × Y [k, j].)

1. for j ← 1 to n do

2. for k ← 1 to n do

3. for i ← 1 to n do

4. Z[i, j] ← Z[i, j] + X[i, k] × Y [k, j]

Iter-MM-kij( Z, X, Y, n )

(Inputs are three n × n matrices X, Y and Z. For each i, j ∈ [1, n], Z[i, j] is set to Z[i, j] + Pn

k=1 X[i, k] × Y [k, j].)

1. for k ← 1 to n do

2. for i ← 1 to n do

3. for j ← 1 to n do

4. Z[i, j] ← Z[i, j] + X[i, k] × Y [k, j]

Iter-MM-kji( Z, X, Y, n )

(Inputs are three n × n matrices X, Y and Z. For each i, j ∈ [1, n], Z[i, j] is set to Z[i, j] + Pn

k=1 X[i, k] × Y [k, j].)

1. for k ← 1 to n do

2. for j ← 1 to n do

3. for i ← 1 to n do

4. Z[i, j] ← Z[i, j] + X[i, k] × Y [k, j]

Figure 1: Standard iterative matrix multiplication algorithm and its variants.

2

m that gives you the smallest running time for Par-Rec-MM. Produce a table or a graph

showing how the running time varies as you change m.

(h) [ 10 Points ] Produce tables / plots similar to those in parts 1(d) and 1(e) for the algorithm

in part 1(g) (with optimized base case). Compare the results with parts 1(d) and 1(e), and

explain what you observe.

(i) [ 10 Points ] Use PAPI to find the total L1/L2/L3 misses incurred (across all cores) by

the fastest implementation from part 1(d) as well as the Par-Rec-MM implementation from

part 1(g) when run on matrices of size 2r × 2

r

, where r is as defined in part 1(a). Tabulate

and explain your results.

Par-Rec-MM( Z, X, Y )

(Inputs are three n × n matrices X, Y and Z. For each i, j ∈ [1, n], Z[i, j] is set to Z[i, j] + Pn

k=1 X[i, k] × Y [k, j]. We

assume n to be a power of 2.)

1. if X is a 1 × 1 matrix then

2. Z ← Z + X × Y

else

3. parallel : Par-Rec-MM( Z11, X11, Y11 ), Par-Rec-MM( Z12, X11, Y12 )

Par-Rec-MM( Z21, X21, Y11 ), Par-Rec-MM( Z22, X21, Y12 )

4. parallel : Par-Rec-MM( Z11, X12, Y21 ), Par-Rec-MM( Z12, X12, Y22 )

Par-Rec-MM( Z21, X22, Y21 ), Par-Rec-MM( Z22, X22, Y22 )

Figure 2: Parallel recursive divide-and-conquer algorithm that multiplies two n × n matrices X

and Y and adds the result to another n × n matrix Z, i.e., for each i, j ∈ [1, n], Z[i, j] is set to

Z[i, j] + Pn

k=1 X[i, k] × Y [k, j]. All entries of Z are initially set to zero. By Z11, Z12, Z21 and

Z22 we denote the top-left, top-right, bottom-left and bottom-right quadrants of Z, respectively.

Similarly for X and Y . We assume that n is a power of 2.

Task 2. [ 100 Points ] Schedulers. This task asks you to implement the following three

schedulers and compare their performance by running Par-Rec-MM from Figure 2 under each:

• distributed randomized work-stealing (DR-Steal),

• distributed randomized work-sharing (DR-Share), and

• centralized work-sharing (C-Share).

For the purposes of this task you can simply tailor each scheduler implementation to run ParRec-MM only instead of implementing their standalone general-purpose versions. To keep things

simple use locks or atomic instructions to control accesses to shared data structures, and maintain

a global flag empty without locks to keep track of when the entire system runs out of work. The

flag is initially set to False. The thread that completes the second sync of Par-Rec-MM (after

Line 4 of Figure 2) sets that variable to True.

3

In case of the DR-Steal scheduler, each thread will have its own task deque. Whenever a thread

runs out of work it first checks its own deque for tasks. If the deque is nonempty it extracts the

bottommost task from it and starts working on that. If the deque is empty the thread becomes a

thief and tries to steal the topmost task from the deque of a thread chosen uniformly at random.

The thread will continue with its steal attempts until either one succeeds or it is sure (or reasonably

sure) that the entire system has run out of work.

Under the DR-Share scheduler, too, each thread will have its own task deque. Whenever a thread

spawns new tasks it puts each of them (except the one it wants to execute itself) at the top of

the deque of a thread chosen uniformly at random. Whenever a thread runs out of work it only

looks for tasks in its own deque. It keeps checking its own deque until either someone else puts a

new task in it or the thread becomes sure (or reasonably sure) that there is no work in the entire

system.

The C-Share scheduler maintains a centralized task queue. Whenever a thread spawns new tasks

it keeps one of them for immediate execution and puts the rest of them in that centralized queue.

Whenever a thread becomes idle it deques a task (if exists) from the centralized queue and starts

executing it.

(a) [ 25 Points ] Run Par-Rec-MM under each of the three schedulers by varying n from 210

to 2s

(consider only powers of 2), where s is the largest integer such that none of the three

implementations takes more than five minutes to terminate. Use all available cores. For each

run compute its rate of execution in GFLOPS5

. Create a single plot showing the GFLOPS

for all three programs. Explain your findings.

(b) [ 25 Points ] Repeat part 2(a) but instead of plotting GFLOPS plot cache miss rates6

for

L1, L2 and L3 caches. Explain your findings.

(c) [ 25 Points ] Repeat part 2(a) but instead of varying n keep n fixed at 2s

(where s is as

defined in part 2(a)) and vary p from 1 to the maximum number of cores available on the

machine. For each run compute its efficiency7

. Create a single plot showing how 1−efficiency

(giving the fraction of total work spent in overheads) varies with p for all three programs.

Explain your findings.

(d) [ 25 Points ] For this part we modify DR-Steal and DR-Share to DR-Steal-Mod and

DR-Share-Mod, respectively, to keep track of the total amount of work the tasks in each

deque represent. We assume that the task of multiplying two m × m matrices requires m3

units of work. So, when such a task is added to a deque it increases the amount of work in

that deque by m3 and when it is removed it decreases the work by the same amount.

5To obtain the rate of execution of a program in FLOPS (FLoating point Operations Per Second) divide the

total number of floating point operations it performs by its running time in seconds. GFLOPS (Giga FLOPS) is

obtained by dividing FLOPS by 109

. The total number of floating point operations performed by Par-Rec-MM for

multiplying two n × n matrices is 2n

3

.

6For this task we will compute the cache miss rate of a particular run of Par-Rec-MM by dividing the total

number of cache misses (across all cores) it incurs by the total number of cache accesses in the worst case which is

3n

3 when multiplying two n × n matrices.

7Efficiency Ep of a program run on p processing cores is defined as T1

pTp

, where Tp is its running time on p cores.

4

When a thread runs out of work under DR-Steal-Mod it chooses two deques uniformly at

random and attempts to steal work from the one that has more work (breaking ties arbitrarily). On the other hand, when a thread wants to put a newly spawned task in another deque

under DR-Steal-Mod it chooses two deques uniformly at random and inserts the task into

the one that has less work (again breaking ties arbitrarily).

Now repeat parts 2(a) and 2(c) with DR-Steal-Mod and DR-Share-Mod, and compare

the results with what you got with DR-Steal and DR-Share, respectively. Explain your

findings.

Task 3. [ 50 Points ] Some Analyses. This task asks you to analyze some aspects of the

schedulers from Task 2.

(a) [ 10 Points ] Suppose that in DR-Steal a thread stops looking for work and terminates

after 2p ln p consecutive failed steal attempts. Prove that during those 2p ln p attempts, the

thread has checked all deques in the system w.h.p. in p (and found each of them empty).

(b) [ 5 Points ] Argue that even if a thread finds every deque in the system empty in consecutive

failed steal attempts that does not guarantee that the entire system has run out of work.

(c) [ 5 Points ] Argue that even if threads follow the strategy in part 3(a) to terminate prematurely (as suggested by part 3(b)), all work in the system will still be completed.

(d) [ 15 Points ] Suppose in DR-Share from Task 2 no two enqueues (i.e., enqueuing a new task

into a deque) in the entire system happen at exactly the same time. Then prove that during

any sequence of p consecutive enqueues each deque undergoes O

?

ln p

ln ln p

?

enqueue operations

w.h.p. in p.

(e) [ 15 Points ] Consider the following simplified version of DR-Share-Mod from Task 2.

Whenever a thread needs to put a newly spawned task in another deque it chooses two deques

unformly at random and enqueues the task into the deque with the fewer tasks (breaking ties

arbitrarily). This part asks for a simple intuitive (not rigorous at all) proof that such a

strategy leads to a more balanced distribution of tasks among deques compared to what

happens under DR-Share (part 3(d)).

Suppose that in the version of DR-Share-Mod given above no two enqueue attempts (i.e.,

enqueuing a new task into a deque) in the entire system happen at exactly the same time.

Consider p such consecutive enqueue attempts. Let fi be the fraction of the p deques that

have received at least i tasks during that time. We will say that a task has rank i in a deque

provided it is the i-th task landing in that deque during those p enqueue operations.

i. Argue that one can expect fi+1 ≤ (fi)

2

.

ii. Use your result from above to show that fi ≤

1

2

2i−1

.

iii. Use your result from above to show that no task is likely to have a rank larger than

log log n during those p consecutive enqueue attempts.

5

APPENDIX 1: What to Turn in

One compressed archive file (e.g., zip, tar.gz) containing the following items.

– Source code, makefiles and job scripts.

– A PDF document containing all answers and plots.

APPENDIX 2: Things to Remember

– Please never run anything that takes more than a minute or uses multiple cores

on login nodes. Doing so may result in account suspension. All runs must be submitted as

jobs to compute nodes (even when you use Cilkview or PAPI).

– Please store all data in your work folder ($WORK), and not in your home folder ($HOME).

– When measuring running times please exclude the time needed for reading the input and

writing the output. Measure only the time needed by the algorithm.

6