Homework 3: All-Pairs Shortest Path

Contents

Goal

Requirements

Command line specification

Input specification

Output specification

Grading

Submission

Resources

Blocked Floyd-Warshall algorithm

Final Notes

Goal

1. Understand the all-pairs shortest path problem.

2. Get familiar with CUDA by implementing the blocked Floyd-Warshall algorithm.

Requirements

In this assignment, you are asked to implement two versions of all-pairs shortest path.

hw3a. Thread version. You may use pthread or std::thread. The code should run on apollo. You can use any

algorithm for shortest path.

hw3b. CUDA version. The code should run on hades. You should use the blocked Floyd-Warshall algorithm.

Both versions follow the same input & output format.

Command line specification

For hw3a on apollo:

srun -N1 -n1 -c12 ./hw3a inputfile outputfile

For hw3b on hades:

srun –gres=gpu:1 -pipl ./hw3b inputfile outputfile

Where inputfile and outputfile are the paths to the corresponding files. The format of the input/output

files are specified below.

Input specification

The input is a binary file containing a directed graph with non-negative edge distances.

The first two integers mean number of vertices (V) and number of edges (E).

After that, there are the list of edges. Each edge contains three integers: source vertex id ( ) ,

destination vertex id ( ), edge weight ( ).

All values are non-negative integers. The values of vertex indexes start at 0.

The ranges for the input is:

srci

dsti wi

for correctness

for performance testing

(there will not be repeated edges)

Here’s an example:

offset type decimal value description

0000 32-bit integer 3 # vertices (V)

0004 32-bit integer 6 # edges (E)

0008 32-bit integer 0 src id for edge 0

0012 32-bit integer 1 dst id for edge 0

0016 32-bit integer 3 edge 0’s distance

0020 32-bit integer src id for edge 1

… … … …

0076 32-bit integer edge 5’s distance

Output specification

The output file is also in binary format.

For an input file with vertices, you should output an output file containing integers.

The first integers should be the shortest path distances for starting from edge 0:

; then the following integers would be the

shortest path distances starting from edge 1: ; and so

on, totaling integers.

where .

If there is no valid path between i→j, please output with: .

Example output file:

offset type decimal value description

0000 32-bit integer 0 dist(0, 0)

0004 32-bit integer ? dist(0, 1)

0008 32-bit integer ? dist(0, 2)

… … … …

4V2

-8 32-bit integer ? min dist(V-1, V-2)

4V2

-4 32-bit integer 0 min dist(V-1, V-1)

Grading

1. hw3a (30%)

An unknown number of test cases will be used to test your implementation. You get 30 points for hw3a if

you passed all the test cases, points if there are failed test cases. There will be hidden

testcases.

For each test case, you pass it if:

2 ≤ V ≤ 10000

2 ≤ V ≤ 45000

0 ≤ E ≤ V × (V − 1)

0 ≤ srci , dsti < V

srci ≠ dsti

if srci = srcj then dsti ≠ dstj

0 ≤ wi < 1000

V V

2

V

dist(0, 0), dist(0, 1), dist(0, 2),…, dist(0, V − 1) V

dist(1, 0), dist(1, 1), dist(1, 2),…, dist(1, V − 1)

V

2

dist(i, j) = 0 i = j

dist(i, j) = 2 − 1 = 1073741823

30

max(0, 30 − 2k) k

The answer is correct.

The execution time of your implementation is shorter than , where is the

execution time of TA’s sequential code.

2. hw3b correctness (30%)

An unknown number of test cases will be used to test your implementation. You get 30 points for hw3a if

you passed all the test cases, points if there are failed test cases. There will be hidden

testcases.

For each test case, you pass it if:

The answer is correct.

The execution time of your implementation is shorter than 30 seconds.

Note that for the purpose for correctness testing, .

3. hw3b performance (30%)

Points are given according to the largest test case (largest ) you can complete within 30 seconds.

3. Demo (10%)

Each student is given 7 minutes to explain the implementation followed by some questions from TA.

Points are given according to your understanding and explanation of your code, and your answers of

the TA questions.

Submission¶

Put these source code under at ~/homework/hw3 in apollo31, then run hw3a-judge:

Your Makefile for hw3a – ~/homework/hw3/Makefile

Source code for thread version – ~/homework/hw3/hw3a.cc or ~/homework/hw2/hw3a.c

Put these source code under at ~/homework/hw3 in hades01, then run hw3b-judge:

Your Makefile for hw3b – ~/homework/hw3/Makefile

Source code for the CUDA version – ~/homework/hw3/hw3b.cu

make hw3a and make hw3b will be used to compile your code.

Use hw3a-judge and hw3b-judge to check and submit your code before the deadline. You can submit as

many times as you want. You need to answer y when judging to update your submitted code.

Resources

Sequential Blocked FW source code – hades:/home/ipl19/y/hw3/seq.cc

Recommended compiler flags

hw3a: g?? -O3 -pthread -std=c++11 -Wall

hw3b: nvcc -O3 -std=c++11 -Xptxas=-v -arch=sm_61

No example Makefile for you this time!

Scoreboard location: http://ipl.cs.nthu.edu.tw/s/hw3

Correctness Testcases: /home/ipl19/y/hw3/c

Performance Testcases: /home/ipl19/y/hw3/p, only on hades

Use the hw3-cat command to view the binary test cases in text format.

Blocked Floyd-Warshall algorithm

Tseq ÷ 6 + 10 seconds Tseq

max(0, 30 − 2k) k

V ≤ 10000

V

Given an matrix where represents the distance (weight of the edge) from a

vertex to a vertex in a directed graph with vertices. We define an matrix where

denotes the shortest-path distance from a vertex to a vertex . Let be the result

which all the intermediate vertices are in the set .

We define as the following:

The matrix gives the answer to the all-pairs shortest path problem.

In the blocked all-pairs shortest path algorithm, we partition into blocks of

submatrices. The number is called the blocking factor. For instance, in figure 1, we divide a matrix

into submatrices (or blocks) by .

Figure 1: Divide a matrix by B = 2

The blocked version of the Floyd-Warshall algorithm will perform rounds, and each round is divided

into 3 phases. It performs iterations in each phase.

Assuming a block is identified by its index , where . The block with index is

denoted by .

In the following explanation, we assume and . The execution flow is described step by step as

follows:

Phase 1: self-dependent blocks.

In the -th iteration, the first phase is to compute the pivot block .

For instance, in the 1st iteration, is computed as follows:

V × V W = [w(i, j)] w(i, j) ≥ 0

i j V V × V D = [d(i, j)]

d(i, j) i j D = [ (i, j)]

(k) d

(k)

{1, 2, . . . , k}

d (i, j)

(k)

d (i, j) = {

(k) w(i, j)

min(d (i, j), (i, k − 1) + (k − 1, j))

(k−1) d

(k−1) d

(k−1)

if k = 0;

if k ≥ 1.

D = (i, j)

(V ) d

(V )

D ⌈V /B⌉ × ⌈V /B⌉ B × B

B 6 × 6

3 × 3 B = 2

⌈V /B⌉

B

(I, J) 0 ≤ I, J < ⌈V /B⌉ (I, J)

D

(k)

(I,J)

N = 6 B = 2

k B × B D

(k⋅B)

(k−1,k−1)

D

(2)

(0,0)

d (0, 0) = min( (0, 0), (0, 0) + (0, 0))

(1) d

(0) d

(0) d

(0)

d (0, 1) = min( (0, 1), (0, 0) + (0, 1))

(1) d

(0) d

(0) d

(0)

d (1, 0) = min( (1, 0), (1, 0) + (0, 0))

(1) d

(0) d

(0) d

(0)

d (1, 1) = min( (1, 1), (1, 0) + (0, 1))

(1) d

(0) d

(0) d

(0)

d (0, 0) = min( (0, 0), (0, 1) + (1, 0))

(2) d

(1) d

(1) d

(1)

d (0, 1) = min( (0, 1), (0, 1) + (1, 1))

(2) d

(1) d

(1) d

(1)

d (1, 0) = min( (1, 0), (1, 1) + (1, 0))

(2) d

(1) d

(1) d

(1)

d (1, 1) = min( (1, 1), (1, 1) + (1, 1))

(2) d

(1) d

(1) d

(1)

Note that the result of depends on the result of and therefore cannot be computed in parallel with

the computation of .

Phase 2: pivot-row and pivot-column blocks.

In the -th iteration, it computes all and where .

The result of pivot-row / pivot-column blocks depend on the result in phase 1 and itself.

For instance, in the 1st iteration, the result of depends on and :

Phase 3: other blocks.

In the -th iteration, it computes all where .

The result of these blocks depends on the result from phase 2 and itself.

For instance, in the 1st iteration, the result of depends on and :

Figure 2: The 3 phases of blocked FW algorithm in the first 1st iteration.

d

(2) d

(1)

d

(1)

k D

(k⋅B)

(h,k) D

(k⋅B)

(k,h) h ≠ k

D

(2)

(0,2) D

(2)

(0,0) D

(0)

(0,2)

d (0, 4) = min( (0, 4), (0, 0) + (0, 4))

(1) d

(0) d

(2) d

(0)

d (0, 5) = min( (0, 5), (0, 0) + (0, 5))

(1) d

(0) d

(2) d

(0)

d (1, 4) = min( (1, 4), (1, 0) + (0, 4))

(1) d

(0) d

(2) d

(0)

d (1, 5) = min( (1, 5), (1, 0) + (0, 5))

(1) d

(0) d

(2) d

(0)

d (0, 4) = min( (0, 4), (0, 1) + (1, 4))

(2) d

(1) d

(2) d

(1)

d (0, 5) = min( (0, 5), (0, 1) + (1, 5))

(2) d

(1) d

(2) d

(1)

d (1, 4) = min( (1, 4), (1, 1) + (1, 4))

(2) d

(1) d

(2) d

(1)

d (1, 5) = min( (1, 5), (1, 1) + (1, 5))

(2) d

(1) d

(2) d

(1)

k D

(k⋅B)

(h1 ,h2 ) h1

, h2 ≠ k

D

(2)

(1,2) D

(2)

(1,0) D

(2)

(0,2)

d (2, 4) = min( (2, 4), (2, 0) + (0, 4))

(1) d

(0) d

(2) d

(2)

d (2, 5) = min( (2, 5), (2, 0) + (0, 5))

(1) d

(0) d

(2) d

(2)

d (3, 4) = min( (3, 4), (3, 0) + (0, 4))

(1) d

(0) d

(2) d

(2)

d (3, 5) = min( (3, 5), (3, 0) + (0, 5))

(1) d

(0) d

(2) d

(2)

d (2, 4) = min( (2, 4), (2, 1) + (1, 4))

(2) d

(1) d

(2) d

(2)

d (2, 5) = min( (2, 5), (2, 1) + (1, 5))

(2) d

(1) d

(2) d

(2)

d (3, 4) = min( (3, 4), (3, 1) + (1, 4))

(2) d

(1) d

(2) d

(2)

d (3, 5) = min( (3, 5), (3, 1) + (1, 5))

(2) d

(1) d

(2) d

(2)

Figure 3: The computations of , and their dependeicies in the first iteration.

Figure 4: In this particular example where and , we will require rounds.

Final Notes

Contact TA via [email protected] or iLMS immediately if you find any problems with the homework

specification, judge scripts, example source code or the test cases.

You are allowed to discuss and exchange ideas with others, but you are required to write the code on your

own. You’ll got 0 points if we found you cheating.

D

(2)

(0,2) D

(2)

(1,2)

V = 6 B = 2 ⌈V /B⌉ = 3