Workshop 8

Thread Divergence


In this workshop, you evaluate the effects of thread divergence on a reduction algorithm. 


Learning Outcomes

Upon successful completion of this workshop, you will have demonstrated the abilities to

  1. describe the partitioning of threads within a thread block
  2. reduce accesses to global memory through shared memory programming
  3. write kernel code that minimizes thread divergence
  4. summarize what you think that you have learned in completing this workshop

Specifications

This workshop consists of two parts:

  1. write a pair of kernels to calculate the dot product of two vectors using shared memory
  2. upgrade the kernels to minimize thread divergence during reduction operations

Shared Memory

Complete the following program by adding two kernels:

  • the first calculates the product of corresponding elements in the first two arrays, stores the product in shared memory, accumulates the data in shared memory and stores the result in the third array
  • the second kernel accumulates the data stored in each element of the array received and stores the result in the first element of that array
 // Thread Divergence - Workshop 8
 // w8.1.cu

 #include <iostream>
 #include <cstdlib>
 #include <cuda_runtime.h>
 // to remove intellisense highlighting
 #include <device_launch_parameters.h>
 #ifndef __CUDACC__
 #define __CUDACC__
 #endif
 #include <device_functions.h>

 const int ntpb = 1024; // number of threads per block

 void init(float* a, int n) {
     float f = 1.0f / RAND_MAX;
     for (int i = 0; i < n; i++)
         a[i] = std::rand() * f; // [0.0f 1.0f]
 }

 // calculate the dot product block by block
 __global__ void dotProduct(const float* a, const float* b, float* c, int n) {
     // store the product of a[i] and b[i] in shared memory
     // sum the data in shared memory
     // store the sum in c[blockIdx.x]
 }

 // accumulate the block sums
 __global__ void accumulate(float* c, int n) {
     // store the elements of c[] in shared memory
     // sum the data in shared memory
     // store the sum in c[0]
 }

 int main(int argc, char** argv) {
     // interpret command-line arguments
     if (argc != 2) {
         std::cerr << argv[0] << ": invalid number of arguments\n"; 
         std::cerr << "Usage: " << argv[0] << "  size_of_vectors\n"; 
         return 1;
     }
     int n = std::atoi(argv[1]);
     int nblocks = (n + ntpb - 1) / ntpb;
     if (nblocks > ntpb) {
         nblocks = ntpb;
         n = nblocks * ntpb;
     }

     // host vectors
     float* h_a = new float[n];
     float* h_b = new float[n];
     init(h_a, n);
     init(h_b, n);
     // device vectors (d_a, d_b, d_c)
     float* d_a;
     float* d_b;
     float* d_c;
     cudaMalloc((void**)&d_a, n * sizeof(float));
     cudaMalloc((void**)&d_b, n * sizeof(float));
     cudaMalloc((void**)&d_c, nblocks * sizeof(float));

     // copy from host to device h_a -> d_a, h_b -> d_b
     cudaMemcpy(d_a, h_a, n * sizeof(float), cudaMemcpyHostToDevice);
     cudaMemcpy(d_b, h_b, n * sizeof(float), cudaMemcpyHostToDevice);

     // dot product on the device
     dotProduct<<<nblocks, ntpb>>>(d_a, d_b, d_c, n);

     // synchronize
     cudaDeviceSynchronize();

     // accumulate the block sums on the device
     accumulate<<< 1, nblocks>>>(d_c, nblocks);

     // copy from device to host d_c[0] -> h_d
     float h_c;
     cudaMemcpy(&h_c, d_c, sizeof(float), cudaMemcpyDeviceToHost); 

     float hx = 0.f;
     for (int i = 0; i < n; i++)
         hx += h_a[i] * h_b[i];
     // compare results
     std::cout << "Device = " << h_c << " Host = " << hx << std::endl; 

     // free device memory
     cudaFree(d_a);
     cudaFree(d_b);
     cudaFree(d_c);

     // free host memory
     delete [] h_a;
     delete [] h_b;

     // reset the device
     cudaDeviceReset();
 }

Compile and test your code checking that it calculates the same result on the device as it does on the host. 

Minimize Thread Divergence

Copy your first solution to a file named w8.2.cu.  Upgrade the kernels in this new file to minimize thread divergence.  For this upgrade, you will need to account for warp partitioning of the threads within a block.

 // Thread Divergence - Minimized - Workshop 8
 // w8.2.cu

 // ...

 // calculate the dot product block by block
 __global__ void dotProduct(const float* a, const float* b, float* c, int n)  { 

     // upgrade your original kernel here

 }

 // accumulate the block sums
 __global__ void accumulate(float* c, int n) {

     // upgrade your original kernel here

 }

 int main(int argc, char** argv) {
         // same as above ...
 }

Compile and test your code to ensure that it calculates the same result on the deivce as on the host. 

Profiles

Start the Visual Profiler by entering the following at the command line:

 nvvp

Complete the table below from the dotProduct kernel times reported by the profiler.

n With TD Minimal TD
2500    
5000    
7500    
10000    

Prepare a 3D look realistic column chart plotting the memcpy, kernel and session times against the number of elements in the two vectors (n) along the horizontal axis as shown below. 

Thread Divergence

You can create the chart in Open Office using the following steps:

  • Highlight data and labels
  • Select Chart in the Toolbar
  • Chart Type - check 3D Look Realistic Column
  • Data Range - 1st row as label, 1st column as label
  • Chart Elements - add title, subtitle, axes labels

You can create the chart in Excel using the following steps:

  • Select Insert Tab -> Column -> 3D Clustered Column
  • Select Data -> remove n -> select edit on horizontal axis labels -> add n column
  • Select Chart tools -> Layout -> Chart Title - enter title and subtitle
  • Select Chart tools -> Layout -> Axis Titles -> Select axis - enter axis label

Save your chart as part of your spreadsheet file.


SUBMISSION

Copy the results of your tests for both versions into a file named w8.txt.  This file should include

  • your userid
  • console output from running your test cases

Upload your typescript to Blackboard: 

  • Login to
  • Select your course code
  • Select Workshop 8 under Workshops
  • Upload w8.txt and w8.ods or w8.xls
  • Under "Add Comments" write a short note to your instructor:  Add a sentence or two describing what you think you have learned in this workshop.
  • When ready to submit, press "Submit"






  Designed by Chris Szalwinski   Copying From This Site   
Logo
Creative Commons License