Workshop 6

A Simple Kernel


In this workshop, you code a kernel that multiplies two square matrices and profile your application for different sizes of matrices. 


Learning Outcomes

Upon successful completion of this workshop, you will have demonstrated the abilities to

  1. code a kernel that executes on the device
  2. calculate the number of grid blocks required for a CUDA solution
  3. launch an execution configuration
  4. profile an application using a parallel profiler
  5. summarize what you think that you have learned in completing this workshop

Specifications

This workshop consists of two parts:

  • coding a kernel that calculates a coefficient in the multiplication of two square matrices
  • profiling your solution for a range of square-matrix sizes

Kernel

The following incomplete application takes a matrix of user-specified size, initializes its components, multiplies the matrix by itself and copies the result to host memory.  The number of rows (and columns) in the matrices is a command-line argument. 

 // Simple Matrix Multiply - Workshop 6
 // w6.cu

 #include <iostream>
 #include <iomanip>
 #include <cstdlib>
 #include <chrono>
 // add CUDA runtime header file
 using namespace std::chrono;

 const int ntpb = 32; // number of threads per block

 // - add your kernel here









 // check reports error if any
 //
 void check(const char* msg, const cudaError_t err) {
     if (err != cudaSuccess)
         std::cerr << "*** " << msg << ":" << cudaGetErrorString(err) << " ***\n";
 }

 // display matrix M, which is stored in row-major order
 //
 void display(const char* str, const float* M, int nr, int nc)
 {
     std::cout << str << std::endl;
     std::cout << std::fixed << std::setprecision(4);
     for (int i = 0; i < nr; i++) {
         for (int j = 0; j < nc; j++)
             std::cout << std::setw(10)
              << M[i * nc + j];
         std::cout << std::endl;
     }
     std::cout << std::endl;
 }

 // report system time
 //
 void reportTime(const char* msg, steady_clock::duration span) {
     auto ms = duration_cast<milliseconds>(span);
     std::cout << msg << " - took - " <<
      ms.count() << " millisecs" << std::endl;
 }

 // matrix multiply
 //
 void sgemm(const float* h_a, const float* h_b, float* h_c, int n) {

     // - calculate number of blocks for n rows

     // allocate memory for matrices d_a, d_b, d_c on the device

     // - add your allocation code here

     // copy h_a and h_b to d_a and d_b (host to device)
     // - add your copy code here

     // launch execution configuration
     // - define your 2D grid of blocks
     // - define your 2D block of threads
     // - launch your execution configuration
     // - check for launch termination

     // copy d_c to h_c (device to host)
     // - add your copy code here

     // deallocate device memory
     // - add your deallocation code here

     // reset the device
     cudaDeviceReset();
 }

 int main(int argc, char* argv[]) {
     if (argc != 2) {
         std::cerr << argv[0] << ": invalid number of arguments\n"; 
         std::cerr << "Usage: " << argv[0] << "  size_of_vector\n"; 
         return 1;
     }
     int n = std::atoi(argv[1]); // number of rows/columns in h_a, h_b, h_c 
     steady_clock::time_point ts, te;

     // allocate host memory
     ts = steady_clock::now();
     float* h_a = new float[n * n];
     float* h_b = new float[n * n];
     float* h_c = new float[n * n];

     // populate host matrices a and b
     for (int i = 0, kk = 0; i < n; i++)
         for (int j = 0; j < n; j++, kk++)
             h_a[kk] = h_b[kk] = (float)kk / (n * n);
     te = steady_clock::now();
     reportTime("allocation and initialization", te - ts);

     // h_c = h_a * h_b
     ts = steady_clock::now();
     sgemm(h_a, h_b, h_c, n);
     te = steady_clock::now();
     reportTime("matrix-matrix multiplication", te - ts);

     // display results
     if (n <= 5) {
         display("h_a :", h_a, n, n);
         display("h_b :", h_b, n, n);
         display("h_c = h_a h_b :", h_c, n, n);
     }

     // check correctness
     std::cout << "correctness test ..." << std::endl;
     for (int i = 0; i < n; i++)
         for (int j = 0; j < n; j++) {
             float sum = 0.0f;
             for (int k = 0; k < n; k++)
                 sum += h_a[i * n + k] * h_b[k * n + j];
             if (std::abs(h_c[i * n + j] - sum) > 1.0e-3f)
              std::cout << "[" << i << "," << j << "]" << h_c[i * n + j]
              << " != " << sum << std::endl;
         }
     std::cout << "done" << std::endl;

     // deallocate host memory
     delete [] h_a;
     delete [] h_b;
     delete [] h_c;
 }

Complete the coding of sgemm, compile your solution and record the timing for the sizes listed below. 

n Allocation Initialization Matrix Multiplication
250    
500    
750    
1000    

You may find that for higher matrix sizes the launch takes too much time and is terminated.  Add error handling across the launch to confirm the cause. 

Profile

Start the NSight Profiler in Visual Studio:

  • NSight -> Start Performance Analysis
  • Select Trace Application under Activity Type
  • Select CUDA under Trace Settings
  • Click the Launch Button under Application Control

Results

Complete the table below from the results reported by the profiler.

n Memcpy Kernel Session
250      
500      
750      
1000      

Prepare a 3D look realistic column chart plotting the memcpy, kernel and session times against n along the horizontal axis as shown below. 

Chart

You can create the chart in Open Office using the following steps:

  • Highlight data and labels
  • Select Chart in the Toolbar
  • Chart Type - check 3D Look Realistic Column
  • Data Range - 1st row as label, 1st column as label
  • Chart Elements - add title, subtitle, axes labels

You can create the chart in Excel using the following steps:

  • Select Insert Tab -> Column -> 3D Clustered Column
  • Select Data -> remove n -> select edit on horizontal axis labels -> add n column
  • Select Chart tools -> Layout -> Chart Title - enter title and subtitle
  • Select Chart tools -> Layout -> Axis Titles -> Select axis - enter axis label

Save your chart as part of your spreadsheet file.


SUBMISSION

Copy the results of your tests for both versions into a file named w6.txt.  This file should include

  • your userid
  • the source code for your solution
  • output from running your test cases

Upload your typescript to Blackboard: 

  • Login to
  • Select your course code
  • Select Workshop 6 under Workshops
  • Upload w6.txt and w6.ods or w6.xls
  • Under "Add Comments" write a short note to your instructor:  Add a sentence or two describing what you think you have learned in this workshop.
  • When ready to submit, press "Submit"






  Designed by Chris Szalwinski   Copying From This Site   
Logo
Creative Commons License