Workshop 7

Reduction


In this workshop, you code kernels that calculate the dot product of two vectors. 


Learning Outcomes

Upon successful completion of this workshop, you will have demonstrated the abilities

  • to scale a problem by reducing it to a set of smaller problems
  • to store intermediate results for blocks of cooperating threads
  • to use shared memory to improve the CGMA ratio
  • to summarize what you think that you have learned in completing this workshop

Specifications

This workshop consists of three parts:

  • a small vector solution that uses a single block of cooperating threads
  • a scaled solution that uses multiple blocks of cooperating threads
  • an efficient scaled solution that uses shared memory

Small Vector Reduction

The incomplete program listed below populates each of two host vectors (stored as arrays) with equal numbers of random values.  The user specifies the size of each vecotr on the command line.  The program copies the host data to the device, calculates the dot product of the two vectors, retrieves the result from the device, and compares it to the dot product calculated on the host. 

The grid configuration is a single block with the number of threads equal to the number of elements in each vector.

Kernel

The kernel calculates the dot product on a set of cooperating threads in two parts:

  1. calculates the product of two identically indexed elements of the two vectors
  2. sums the products

The kernel stores the product in one of the vectors and performs the reduction on the elements of that vector, leaving the result in its first element. 

Your Task

Complete the following source code assuming that all CUDA calls will be successful.  Your solution only works for up to 512 or 1024 elements, depending on your device's compute capability.  Name your source file w7_1.cu

 // Reduction - Workshop 7
 // w7_1.cu

 #include <iostream>
 #include <cstdlib>
 #include <ctime>
 // CUDA header file



 void init(float* a, int n) {
     float f = 1.0f / RAND_MAX;
     for (int i = 0; i < n; i++)
         a[i] = std::rand() * f; // [0.0f 1.0f]
 }

 // kernel code










 int main(int argc, char** argv) {
     // interpret command-line arguments
     if (argc != 2) {
         std::cerr << argv[0] << ": invalid number of arguments\n"; 
         std::cerr << "Usage: " << argv[0] << "  size_of_vectors\n"; 
         return 1;
     }
     int n = std::atoi(argv[1]);
     std::srand((unsigned)time(nullptr));

     // host vectors
     float* h_a = new float[n];
     float* h_b = new float[n];
     init(h_a, n);
     init(h_b, n);
     // dot product on the host
     float h_h = 0.f;
     for (int i = 0; i < n; i++)
         h_h += h_a[i] * h_b[i];

     // allocate memory for device vectors (d_a[n], d_b[n])



     // copy host vectors to device vectors h_a -> d_a, h_b -> d_b


     // launch the grid of threads


     // copy the result from the device to the host d_a -> h_c
     float h_c;


     // compare the results
     std::cout << "Device = " << h_c << "\nHost   = " << h_h << std::endl; 

     // free device memory



     // free host memory
     delete [] h_a;
     delete [] h_b;
 }

Compilation

Compile your solution and test the executable for 200 elements.  A command line output looks something like:

 >nvcc w7_1.cu
 w7_1.cu
    Creating library a.lib and object a.exp 

 >a 200
 Device = 46.87
 Host   = 46.87

Scalable Solution

Upgrade your small-vector solution to accommodate vectors of sizes larger than the maximum number of threads in a block, starting with the incomplete code listed below.  Name your scalable solution w7_2.cu

Two Kernels

Pre-set the number of threads per block to ntpb (a pre-defined constant, say 512 or 1024). 

Code two separate kernels.  The first kernel is named product and multiplies corresponding elements of the two vectors and stores their product in the first vector (d_a).  The second kernel is named reduce and sums the elements of d_a as generated by the first kernel.  This second kernel reduces all of the elements accessible by a block of threads to a single scalar sum and stores the sum in the vector element of d_c that corresponds to the block. 

Your Task

Complete the following source code for your scalable solution: 

 // Reduction - Workshop 7
  // w7_2.cu

 #include <iostream>
 #include <cstdlib>
 #include <ctime>
 // CUDA header file

 const int ntpb = 1024; // number of threads per block

 void init(float* a, int n, bool debug) {
     float f = 1.0f / RAND_MAX;
     for (int i = 0; i < n; i++)
         if (debug)
             a[i] = 1.0f;
         else
             a[i] = std::rand() * f; // [0.0f 1.0f]
 }

 // kernel 1 - product










 // kernel 2 - reduction on a single block










 int main(int argc, char** argv) {
     // interpret command-line arguments
     if (argc != 2 && argc != 3) {
         std::cerr << argv[0] << ": invalid number of arguments\n";
         std::cerr << "Usage: " << argv[0] << "  size_of_vectors\n";
         return 1;
     }
     int n = atoi(argv[1]);
     bool debug = argc == 3;
     std::srand((unsigned)time(nullptr));

     int nblks =                     ; // calculate required number of blocks
     // host vectors
     float* h_a = new float[ntpb * nblks];
     float* h_b = new float[ntpb * nblks];
     init(h_a, n, debug);
     init(h_b, n, debug);
     for (int i = n; i < nblks * ntpb; i++) {
         h_a[i] = 0.0f;
         h_b[i] = 0.0f;
     }
     // dot product on the host
     float h_h = 0.f;
     for (int i = 0; i < n; i++)
         h_h += h_a[i] * h_b[i];

     // allocate device vectors (d_a[nblks * ntpb], d_b[n], d_c[nblks])







     // copy from the host to the device h_a -> d_a, h_b -> d-b



     // calculate product (launch kernel 1)


     // reduce products to one value per block (launch kernel 2)


     // intermediate debugging output
     if (debug) {
         float* h_c = new float[nblks];
         cudaMemcpy(h_c, d_c, nblks * sizeof(float), cudaMemcpyDeviceToHost);
         for (int i = 0; i < nblks; i++)
             std::cout << h_c[i] << ' ';
         std::cout << std::endl;
         delete[] h_c;
     }

     // reduce block values to a single value (launch kernel 2 again)


     // copy final result from device to host - from d_c to h_c
     float h_c;


     // report your results
     std::cout << std::fixed << std::setprecision(3);
     std::cout << "Device = " << h_c << "\nHost   = " << h_h << std::endl;

     // free device memory




     // free host memory
     delete[] h_a;
     delete[] h_b;
 }

The first command line argument specifies the number of elements in each vector.  The second argument, if present, requests debug-friendly initialization and output.

Compilation

Compile your completed solution and test the executable for different sizes of vectors in debug and no-debug modes.  A command line output looks something like this:

 >nvcc w7_2.cu
    Creating library a.lib and object a.exp 

 >a 2048 x
 1024 1024
 Device = 2048
 Host   = 2048

 >a 2049 x
 1024 1024 1
 Device = 2049
 Host   = 2049

 >a 2048
 Device = 514.223
 Host   = 514.223 

 >a 2049
 Device = 506.374
 Host   = 506.374

Shared Memory Solution

Upgrade your scalable solution to improve the second kernel's CGMA by using shared memory for cooperating threads.  Name your upgraded source file w7_3.cu


SUBMISSION

Your solution files for this workshop include:

  1. w7_1.cu - the small-vector solution
  2. w7_2.cu - the scalable solution
  3. w7_3.cu - the scalable solution with shared memory

Create a typescript (see below for how to details) using your w7_3.cu solution file.  Your typescript should include:

  1. a compilation of your source code
  2. a set of test runs for different test cases

Upload your typescript to Blackboard: 

  • Login to
  • Select your course code
  • Select Workshop 7 under Assignments
  • Upload w7.txt
  • Under "Add Comments" describe to your instructor in detail what you have learned in completing this workshop. 
  • When ready to submit, press "Submit"

Windows Typescript

You can create a typescript on a Windows platform by setting the cmd.exe screen buffer size to 9999 (or another suitably big number).  On the cmd.exe window, click the upper-left icon, and choose Properties.  On the Layout tab, change the height setting.  Then, after completing the session, copy (select all) then paste into a separate text file named w7.txt.







  Designed by Chris Szalwinski   Copying From This Site   
Logo
Creative Commons License