This assignment uses matrix multiplication to introduce you to GPU programming and performance issues.
Task 1 (30 points)
Implement a CUDA kernel to compute the product of two random rectangular matrices:
� = � ∗ �
where A is n×p, B is p×m, and C is n×m, and n, m, and p are positive integers whose values are provided as
command-line arguments. For Task 1, use GPU global memory to hold the matrices and carry out the calculation.
The directory /home/fas/cpsc424/ahs3/assignment4/ contains a sample code for square matrices
(matmul.cu), along with a makefile (Makefile) and instructions on how to run on the GPUs (README). Note
that the Makefile contains the list of module files you will need. Algorithmically, the sample code is generally
similar to the codes discussed in chapter 4 of Kirk & Hwu (2nd Edition, available online through the Yale Library,
and on pages 24-26 of the CUDA C Programming Guide posted in the CUDA Documentation folder in the Files
section of Canvas. Each GPU thread will compute a single element of C, and the sample code checks to be sure it
has a sufficient number of threads to carry out the computation.
The sample code includes a CPU version of the calculation, which you should use ONLY for debugging
purposes on small problems (e.g. Run 1 below). (Please be sure you comment it out or disable it the rest of
the time!) In addition to the GPU codes for this assignment, please create a simple serial CPU code that
implements and times the “kij” variant of matrix multiplication described on slide 8 in the
matrix_multiplication.pdf attachment to this assignment. Use this code (running on the node, not the GPU) to
obtain the CPU times requested below. Except for tasks 1(b) and 2(b), all codes should be single precision.
a. Run your kernel in single precision (using float variables) and produce a timing table for the matrix
dimensions shown below.
Run 1 Run 2 Run 3 Run 4 Run 5
n 1,024 8,192 1,024 8,192 8,192
m 1,024 8,192 1,024 8,192 1,024
p 1,024 8,192 8,192 1,024 8,192
The timings may depend on the dimensions of the thread blocks and grid. Experiment with a few different
block and grid dimensions and report only the best result for each case. (Make sure your table shows what
block and grid dimensions you used to obtain each result.) Use your “kij” code to obtain CPU times for
b. Run the case n = m = p = 8,192 using double precision on the GPU. (Note that the sample code includes
a #define statement to make it easy to switch between single precision and double precision.) In your
report, compare the performance for this run with the corresponding single precision run. (You may wish
to use block and grid dimensions for this run that differ from those in Run 2 of part (a).)
c. For square matrices (n = m = p), determine the largest matrix size for which you can run your Task 1
kernels on the GPU in single precision, and print GPU timing results for that size matrix. [Note: Do not
even try to run your CPU code for this part.]
Task 2 (40 points)
Modify your CUDA kernel to exploit shared memory on the GPU using a tiled matrix-multiplication code similar
to those discussed in Chapters 5 & 6 of Kirk & Hwu and on pages 27-29 (illustrated in Figure 10 of the CUDA C
Programming Guide). (Please do not revise your code to use the data structures used in the text or
Programming Guide. Instead, understand those codes and translate the ideas to your code from Task 1.) In
your new kernel, each thread block will compute a single tile of C, so each thread will compute one entry of C. You
may assume that the tiles and blocks are square and of the same size, but do not assume that any of the matrix
dimensions is necessarily a multiple of the tile width. Try to exploit memory coalescing in your new kernel, where
possible. Repeat parts (a) – (c) from Task 1 using your new kernel. With the timing results, you should report the
best tile size, and best block and grid dimensions, that you found in each case. Except for debugging, please do
not run serial calculations for Task 2.
Task 3 (30 points)
An important algorithmic decision in performance tuning is the granularity of thread computations. It is often
advantageous to put more work into each thread and use fewer threads—for example, when some redundant work
exists between threads. The figure below illustrates such a situation in the tiled matrix multiplication kernel (cf.,
chapter 6 of Kirk & Hwu).
In your Task 2 kernel, each thread block computes one tile by forming the dot products related to one tile-row of A
and one tile-column of B. Each tile of C requires a separate set of tile loads (into shared memory) of the required
tiles from global memory. As shown above, the calculations of two adjacent tiles requires the same set of tiles from
A, so it seems inefficient/redundant to load those tiles separately for each tile of C. The redundant tile loads could
be reduced if each block computed multiple adjacent tiles. For example, if each block computed two adjacent tiles
(instead of just one), then each thread could compute corresponding entries in each tile, and the number of tiles
loaded from global memory into shared memory would be reduced by ~25%. Unfortunately, this approach does
increase the usage of registers and shared memory, so it may not always be feasible or pay off.
Modify your single-precision kernel from Task 2 to implement the strategy described here. (The number of
adjacent tiles should be a parameter.) Then run the new kernel for the case n = m = p = 8,192, varying the number
of adjacent tiles handled at one time to try to find the optimum number. (Use the same tile size as you used for the
corresponding test case in Task 2.) Report the timing results obtained for successful trials, and indicate the optimum
one. In Task 3 only, you may assume that the matrix dimensions are multiples of the tile width.
Task 4: Extra Credit (10 points)
Modify your Task 3 code so that it works when the matrix dimensions are not multiples of the tile width.
Demonstrate that the new code works using n = m = p = 8,000 with a 16×16 tile size.
Notes and Special Procedures
1. As described in the README file on the cluster, two nodes (c22n05 and c22n06) have been set aside for
class use for the balance of the semester. Each node contains 2 K80 GPUs that we have configured as 4
assignable GPUs. You may find it best to use interactive sessions for this assignment. To create an
interactive session on a suitable node, use the following srun command:
srun –pty -c 5 –mem-per-cpu=6100 -p cpsc424_gpu -t 2:00:00
–gres-flags=enforce-binding –gres=gpu:1 bash
You may add the –x11 option to this command if you wish to make use of X Windows. The nodes are
set up for shared access by up to 4 students at once, with each session restricted to 5 cores, ~30GB of
CPU memory, and 1 GPU. The partition is also restricted to permit only 1 job per student at a time, but
since 8 students may run at a time, we have allowed a session to run for 2 hours.
If your laptop or desktop machine has a CUDA-capable GPU, you are free to install the CUDA Toolkit
on that machine, so that you can do much of your code development without using the cluster at all.
(You’ll still need to make all the final timing runs on the cluster, however.) If you wish to do that, you
can download the Toolkit for your machine from nvidia.com. (You may need to sign up as a developer
and then download from the CudaZone. For the class, we will use version 8.0 of the Toolkit.)
2. The README file in /home/fas/cpsc424/ahs3/assignment4/ contains information on how to
build GPU codes for this assignment. Additional generic information is available in the Cuda C
Programming Guide and other documents that will be attached to the assignment in the class Canvas site.
Procedures for Programming Assignments
For this class, we will use the Canvas website to submit solutions to programming assignments.
Remember: While you may discuss the assignment with me, a ULA, or your classmates, the source code you turn
in must be yours alone and should not represent collaborations or the ideas of others!
What should you include in your solution package?
1. All source code files, Makefiles, and scripts that you developed, used, or modified. All source code
files should contain proper attributions and suitable comments to explain your code.
2. A report in PDF format containing:
a. Your name, the assignment number, and course name/number.
b. Information on building and running the code:
i. A brief description of the software/development environment used. For example, you should list
the module files you’ve loaded.
ii. Steps/commands used to compile, link, and run the submitted code. Best is to use a Makefile for
building the code and to submit an sbatch script for executing it. (If you ran your code
interactively, then you’ll need to list the commands required to run it.)
iii. Outputs from executing your program.
c. Any other information required for the assignment, including any questions you were asked to answer.
How should you submit your solution?
1. On the cluster, create a directory named “NetID_ps3_cpsc424”. (For me, that would be “ahs3_ps3_cpsc424”.
Put into it all the files you need to submit.
2. Create a compressed tar file of your directory by running the following in its parent directory:
tar -cvzf NetID_ps3_cpsc424.tar.gz NetID_ps3_cpsc424
3. To submit your solution, click on the “Assignments” button on the Canvas website and select this assignment
from the list. Then click on the “Submit Assignment” button and upload your solution file
NetID_ps3_cpsc424.tar.gz. (Canvas will only accept files with a “gz” or “tgz” extension.)You may add
additional comments to your submission, but your report should be included in the attachment. You can use
scp or rsync or various GUI tools (e.g., CyberDuck) to move files back and forth to Omega.
Due Date and Late Policy (Note that this assignment is due in the morning.)
Due Date: Friday, December 7, 2018 by 5:30 p.m.
Late Policy: On time submission: Full credit
Up to 24 hours late: 90% credit
Up to 72 hours late: 75% credit
Up to 1 week late: 50% credit
More than 1 week late: 35% credit