Workshop 9

Memory Coalescence


In this workshop, you code a kernel that calculates a single coefficient in a matrix multiplication and accesses global memory in a coalesced manner. 


Learning Outcomes

Upon successful completion of this workshop, you will have demonstrated the abilities to

  1. write a kernel that uses shared memory to store data for its computations
  2. write a kernel that accesses global memory in a coalesced manner
  3. allocate, copy to and from, and deallocate device memory
  4. profile an application using a parallel profiler
  5. summarize what you think that you have learned in completing this workshop

Specifications

This workshop consists of three parts:

  1. write a kernel without coalesced access to global memory
  2. write a kernel with coalesced access to global memory
  3. profile your kernels for 3 matrix sizes

Kernel Without Coalesced Access

The following partially complete application takes a matrix of user-specified size, initializes its components, multiplies the matrix by itself and copies the result to host memory.  The user-specified size is the command-line argument multiplied by the prescribed tile width (16). 

 // Workshop 9 - Memory Coalescence
 // w9.1.cu

 #include <iostream>
 #include <cstdlib>
 #include <cuda_runtime.h>
 // to remove intellisense highlighting
 #include <device_launch_parameters.h>
 #ifndef __CUDACC__
 #define __CUDACC__
 #endif
 #include <device_functions.h>

 const int TILE_WIDTH = 16;  // tile width in each direction

 __global__ void matMul(const float* a, const float* b, float* c, int width) { 
     // add kernel code without coalesced access here
 }

 int main(int argc, char* argv[]) {
     // interpret command-line arguments
     if (argc != 2) {
         std::cerr << argv[0] << ": invalid number of arguments\n"; 
         std::cerr << "Usage: " << argv[0] << "  no_of_rows|columns\n"; 
         return 1;
     }
     int n = atoi(argv[1]) * TILE_WIDTH; // number of rows/columns in A, B, C 

     float* d_A;
     float* d_B;
     float* d_C;
     float* h_A = new float[n * n];
     float* h_B = new float[n * n];
     float* h_C = new float[n * n];

     // populate host matrices a and b
     int kk = 0;
     for (int i = 0; i < n; i++)
         for (int j = 0; j < n; j++) {
             h_A[kk] = (float)kk;
             h_B[kk] = (float)kk;
         }

     // calculate the number of blocks
     int nblocks = n / TILE_WIDTH;
     dim3 grid(nblocks, nblocks);
     dim3 threads(TILE_WIDTH, TILE_WIDTH);

     // BLAS Level 3 calculation: C = A * B
     // add code - allocate memory for matrices d_A, d_B, d_C on the device

     // add code - copy h_A and h_B to d_A and d_B (host to device)

     // launch grid of threads
     matMul<<<grid, threads>>>(d_A, d_B, d_C, n);

     // copy C to c (device to host)

     // add code - deallocate d_A, d_B, d_C, h_A, h_B, h_C

     // reset the device
     cudaDeviceReset();
 }

Complete the coding, compile the program and test it for the sizes listed in the table below. 

Kernel With Coalesced Access

Copy your completed code for the first kernel (w9.1.cu) to a file named w9.2.cu.  Upgrade the kernel in this copy to access global memory in a coalesced manner.  Compile your upgraded program and test it for the sizes listed in the table below.

Profile

Start the Visual Profiler by entering the following at the command line:

 nvvp

Rerun each test case for your reference and upgraded versions.

Results

Complete the table below from the results reported by the profiler for the kernel and the session.

n Without Coalesced Access With Coalesced Access
64    
128    
192    

Prepare a 3D look realistic column chart plotting the memcpy, kernel and session times for against n along the horizontal axis as shown below. 

Chart

You can create the chart in Open Office using the following steps:

  • Highlight data and labels
  • Select Chart in the Toolbar
  • Chart Type - check 3D Look Realistic Column
  • Data Range - 1st row as label, 1st column as label
  • Chart Elements - add title, subtitle, axes labels

You can create the chart in Excel using the following steps:

  • Select Insert Tab -> Column -> 3D Clustered Column
  • Select Data -> remove n -> select edit on horizontal axis labels -> add n column
  • Select Chart tools -> Layout -> Chart Title - enter title and subtitle
  • Select Chart tools -> Layout -> Axis Titles -> Select axis - enter axis label

Save your chart as part of your spreadsheet file.


SUBMISSION

Copy your source code for both versions to a file named w9.txt.  This file should include

  • your userid
  • your source code
  • the output from compiling your code

Upload your typescript to Blackboard: 

  • Login to
  • Select your course code
  • Select Workshop 9 under Workshops
  • Upload w9.txt and w9.ods or w9.xls
  • Under "Add Comments" write a short note to your instructor:  Add a sentence or two describing what you think you have learned in this workshop.
  • When ready to submit, press "Submit"






  Designed by Chris Szalwinski   Copying From This Site   
Logo
Creative Commons License