Workshop 4

cuBLAS


In this workshop, you compare the cuBLAS matrix multiplier with the gcc cblas matrix multiplier. 


Learning Outcomes

Upon successful completion of this workshop, you will have demonstrated the abilities to

  1. code matrix multiplication using the cblas library functions
  2. code matrix multiplication using the cuBLAS library functions
  3. summarize what you think that you have learned in completing this workshop

Specifications

This workshop repeats the task started in Workshop 2.  There are two versions of the same matrix-multiplication program to complete: the cblas version and the cuBLAS version.  Each executable takes one command-line argument, which is the number of rows/columns in the square matrices being multiplied.

cblas Version

Complete the following program using the cblas implementation of the BLAS standard.  For more information, refer to the L.A. in Science chapter and Workshop 2.

 // Level 3 cblas - Workshop 4
 // w4_cblas.cpp

 #include <iostream>
 #include <iomanip>
 #include <cstdlib>
 #include <chrono>
 // add cblas header file
 using namespace std::chrono;

 // indexing function (column major order)
 //
 inline int idx(int r, int c, int n)
 {
     // ... add indexing formula
 }

 // display matrix M, which is stored in column-major order
 //
 void display(const char* str, const float* M, int nr, int nc)
 {
     std::cout << str << std::endl;
     std::cout << std::fixed << std::setprecision(4);
     for (int i = 0; i < nr; i++) {
         for (int j = 0; j < nc; j++)
             std::cout << std::setw(10)
              << // ... access in column-major order;
         std::cout << std::endl;
     }
     std::cout << std::endl;
 }

 // report system time
 //
 void reportTime(const char* msg, steady_clock::duration span) {
     auto ms = duration_cast<milliseconds>(span);
     std::cout << msg << " - took - " <<
      ms.count() << " millisecs" << std::endl;
 }


 // matrix multiply
 //
 void sgemm(const float* A, const float* B, float* C, int n) {
     steady_clock::time_point ts, te;

     // level 3 calculation: C = alpha * A * B + beta * C

     // add any preliminaries

     ts = steady_clock::now();
     // ... add call to cblas sgemm
     te = steady_clock::now();
     reportTime("matrix-matrix multiplication", te - ts);
     
 }

 int main(int argc, char* argv[]) {
     if (argc != 2) {
         std::cerr << argv[0] << ": invalid number of arguments\n"; 
         std::cerr << "Usage: " << argv[0] << "  size_of_matrices\n"; 
         return 1;
     }
     int n = std::atoi(argv[1]); // no of rows/columns in A, B, C 

     // allocate host memory
     float* h_A = new float[n * n];
     float* h_B = new float[n * n];
     float* h_C = new float[n * n];

     // populate host matrices a and b
     for (int i = 0, kk = 0; i < n; i++)
         for (int j = 0; j < n; j++, kk++)
             h_A[kk] = h_B[kk] = (float)kk;

     // C = A * B
     sgemm(h_A, h_B, h_C, n);

     // display results
     if (n <= 5) {
         display("A :", h_A, n, n);
         display("B :", h_B, n, n);
         display("C = A B :", h_C, n, n);
     }

     // deallocate host memory
     delete [] h_A;
     delete [] h_B;
     delete [] h_C;
 }

Compile and Link

Compile and link your completed version of this program using version 4.8 of GCC or higher along with the O2 optimization switch.  Version 4.9.0 is available in matrix's local system directory and accessible using the following Makefile:

# Makefile for w4
#
GCC_VERSION = 7.2.0
PREFIX = /usr/local/gcc/${GCC_VERSION}/bin/
CC = ${PREFIX}gcc
CPP = ${PREFIX}g++

w4_cblas: w4_cblas.o
        $(CPP) -ow4_cblas w4_cblas.o -lgslcblas 

w4_cblas.o: w4_cblas.cpp
        $(CPP) -c -O2 -std=c++17 w4_cblas.cpp

clean:
        rm *.o
 

To execute this Makefile, enter the command

 > make

To run the executable, enter the command

 > w4_cblas 4

where the argument is the size of the vector/matrix.

Test Results

Test the executable for a command line argument of 4.  The results should look something like:

 matrix-matrix multiplication - took 0 secs 
 A :
     0.0000    4.0000    8.0000   12.0000
     1.0000    5.0000    9.0000   13.0000
     2.0000    6.0000   10.0000   14.0000
     3.0000    7.0000   11.0000   15.0000

 B :
     0.0000    4.0000    8.0000   12.0000
     1.0000    5.0000    9.0000   13.0000
     2.0000    6.0000   10.0000   14.0000
     3.0000    7.0000   11.0000   15.0000

 C = A B :
    56.0000  152.0000  248.0000  344.0000
    62.0000  174.0000  286.0000  398.0000
    68.0000  196.0000  324.0000  452.0000
    74.0000  218.0000  362.0000  506.0000

cuBLAS Version

Modify your cblas version by replacing its definition of the sgemm() function with a definition that uses the CUDA cublas calls needed to access the equivalent cublasSgemm() library function. 

 // Level 3 cuBLAS - Workshop 4
 // w4_cublas.cu

 // ...

 // matrix multiply
 //
 void sgemm(const float* h_A, const float* h_B, float* h_C, int n) {
     steady_clock::time_point ts, te;

     // level 3 calculation: C = alpha * A * B + beta * C
     
     ts = steady_clock::now();
     // ... allocate memory on the device
     te = steady_clock::now();
     reportTime("allocation of device memory for matrices d_A, d_B and d_C",
      te - ts);

     // ... create cuBLAS context

     ts = steady_clock::now();
     // ... copy host matrices to the device
     te = steady_clock::now();
     reportTime("copying of matrices h_A and h_B to device memory", te - ts);

     ts = steady_clock::now();
     // ... calculate matrix-matrix product
     te = steady_clock::now();
     reportTime("matrix-matrix multiplication", te - ts);

     // ... copy result matrix from the device to the host
     te = steady_clock::now();
     reportTime("copying of matrix d_C from device", te - ts);

     // ... destroy cuBLAS context

     ts = steady_clock::now();
     // ... deallocate device memory
     te = steady_clock::now();
     reportTime("deallocation of device memory for matrices A, B and C",
      te - ts);
     
 } 

The instructions to build this version of your program can be found in the chapter entitled CUDA Libraries

Test Results

Test the executable for a command-line argument of 4 and compare the results to those shown above.

Comparison

Run each version for the matrix sizes listed below.  Record the reported matrix-multiplication elapsed time for each size and each version. 

n cblas cuBLAS
500    
1000    
1500    
2000    
2500    
3000    
3500    
4000    

Save this table in a spreadsheet file named w4.ods or w4.xls.  Prepare a 3D look realistic column chart showing the clock times in seconds along the vertical axis and the number of rows/columns (n) along the horizontal axis as shown below. 

Chart

You can create the chart in Open Office using the following steps:

  • Highlight data and labels
  • Select Chart in the Toolbar
  • Chart Type - check 3D Look Realistic Column
  • Data Range - 1st row as label, 1st column as label
  • Chart Elements - add your title, your subtitle, your axes labels

You can create the chart in Excel using the following steps:

  • Select Insert Tab -> Column -> 3D Clustered Column
  • Select Data -> remove n -> select edit on horizontal axis labels -> add n column (500-4000)
  • Select Chart tools -> Layout -> Chart Title - enter title and subtitle
  • Select Chart tools -> Layout -> Axis Titles -> Select axis - enter axis label

Save your chart as part of your spreadsheet file.


SUBMISSION

Copy the results of your initial tests for both versions into a file named w4.txt.  This file should include

  • a listing of your cblas version
  • output from running your cblas version
  • a listing of your cuBLAS version
  • output from running your cuBLAS version

Upload your typescript to Blackboard: 

  • Login to
  • Select your course code
  • Select Workshop 4 under Workshops
  • Upload w4.txt and w4.ods or w4.xls
  • Under "Add Comments" describe to your instructor in detail what you have learned in completing this workshop. 
  • When ready to submit, press "Submit"






  Designed by Chris Szalwinski   Copying From This Site   
Logo
Creative Commons License