Workshop 10

CUDA to OpenCL

In this workshop, you port the matrix product solution developed for Workshop 9 to OpenCL.

Learning Outcomes

Upon successful completion of this workshop, you will have demonstrated the abilities to

  1. write platform-independent code that executes on the device
  2. use OpenCL API functions to access the device hardware
  3. include OpenCL error handling facilities
  4. summarize what you think that you have learned in completing this workshop


For this workshop, store your host code and your device code kernels in separate files.  Your host code accepts as a command line argument the name of the device code file containing the test kernel.

Complete the following host code.  Include profiling of the kernel launch and error handling.  The checkError() function reports the error if unsuccessful and exits immediately. 

 // Workshop 10 - Matrix Multiply using OpenCL
 // w10.cpp

 #include <iostream>
 #include <iomanip>
 #include <fstream>
 #include <cstdlib>
 // add OpenCL header file

 using namespace std;

 const int ntpb = 16;  // number of work units per workgroup

 inline void checkError(cl_int status, const char* name) {
    if (status != CL_SUCCESS) {
        std::cout << "Error: " << name << " (" << status << ") " << std::endl; 
        switch (status) {
            case CL_SUCCESS:
                std::cout << "Success!"; break;
            case CL_DEVICE_NOT_FOUND:
                std::cout << "Device not found."; break;
            case CL_DEVICE_NOT_AVAILABLE:
                std::cout << "Device not available"; break;
                std::cout << "Compiler not available"; break;
                std::cout << "Memory object allocation failure"; break;
            case CL_OUT_OF_RESOURCES:
                std::cout << "Out of resources"; break;
            case CL_OUT_OF_HOST_MEMORY:
                std::cout << "Out of host memory"; break;
                std::cout << "Profiling information not available"; break;
            case CL_MEM_COPY_OVERLAP:
                std::cout << "Memory copy overlap"; break;
            case CL_IMAGE_FORMAT_MISMATCH:
                std::cout << "Image format mismatch"; break;
                std::cout << "Image format not supported"; break;
            case CL_BUILD_PROGRAM_FAILURE:
                std::cout << "Program build failure"; break;
            case CL_MAP_FAILURE:
                std::cout << "Map failure"; break;
            case CL_INVALID_VALUE:
                std::cout << "Invalid value"; break;
            case CL_INVALID_DEVICE_TYPE:
                std::cout << "Invalid device type"; break;
            case CL_INVALID_PLATFORM:
                std::cout << "Invalid platform"; break;
            case CL_INVALID_DEVICE:
                std::cout << "Invalid device"; break;
            case CL_INVALID_CONTEXT:
                std::cout << "Invalid context"; break;
                std::cout << "Invalid queue properties"; break;
            case CL_INVALID_COMMAND_QUEUE:
                std::cout << "Invalid command queue"; break;
            case CL_INVALID_HOST_PTR:
                std::cout << "Invalid host pointer"; break;
            case CL_INVALID_MEM_OBJECT:
                std::cout << "Invalid memory object"; break;
                std::cout << "Invalid image format descriptor"; break;
            case CL_INVALID_IMAGE_SIZE:
                std::cout << "Invalid image size"; break;
            case CL_INVALID_SAMPLER:
                std::cout << "Invalid sampler"; break;
            case CL_INVALID_BINARY:
                std::cout << "Invalid binary"; break;
            case CL_INVALID_BUILD_OPTIONS:
                std::cout << "Invalid build options"; break;
            case CL_INVALID_PROGRAM:
                std::cout << "Invalid program"; break;
                std::cout << "Invalid program executable"; break;
            case CL_INVALID_KERNEL_NAME:
                std::cout << "Invalid kernel name"; break;
                std::cout << "Invalid kernel definition"; break;
            case CL_INVALID_KERNEL:
                std::cout << "Invalid kernel"; break;
            case CL_INVALID_ARG_INDEX:
                std::cout << "Invalid argument index"; break;
            case CL_INVALID_ARG_VALUE:
                std::cout << "Invalid argument value"; break;
            case CL_INVALID_ARG_SIZE:
                std::cout << "Invalid argument size"; break;
            case CL_INVALID_KERNEL_ARGS:
                std::cout << "Invalid kernel arguments"; break;
                std::cout << "Invalid work dimension"; break;
            case CL_INVALID_WORK_GROUP_SIZE:
                std::cout << "Invalid work group size"; break;
            case CL_INVALID_WORK_ITEM_SIZE:
                std::cout << "Invalid work item size"; break;
            case CL_INVALID_GLOBAL_OFFSET:
                std::cout << "Invalid global offset"; break;
            case CL_INVALID_EVENT_WAIT_LIST:
                std::cout << "Invalid event wait list"; break;
            case CL_INVALID_EVENT:
                std::cout << "Invalid event"; break;
            case CL_INVALID_OPERATION:
                std::cout << "Invalid operation"; break;
            case CL_INVALID_GL_OBJECT:
                std::cout << "Invalid OpenGL object"; break;
            case CL_INVALID_BUFFER_SIZE:
                std::cout << "Invalid buffer size"; break;
            case CL_INVALID_MIP_LEVEL:
                std::cout << "Invalid mip-map level"; break;
            default: cout << "Unknown";
        std::cout << std::endl;
        exit (EXIT_FAILURE);

 int main(int argc, char* argv[]) {
     if (argc != 3) {
         std::cerr << "***Incorrect number of arguments***\n";
         return 1;
     int   n = atoi(argv[1]) * ntpb;
     int  nb = n * n * sizeof(float);
     float run_time_gpu;
     // allocate host memory
     float* a = new float[n * n];
     float* b = new float[n * n];
     float* c = new float[n * n];
     // initialize host memory
     for (int i = 0; i < n * n; i++)
         a[i] = b[i] = 0;
     for (int i = 0; i < n * n; i += n + 1)
         a[i] = b[i] = 1.0f;

     // Load Device Program from argv[2]
     ifstream f(argv[2]);
     char cc;
     size_t size = 0;
     while (f) {
     char* src = new char[size+1];
     size = 0;
     while (f)
     if (size) src[--size] = '\0'; // overwrite eof

     // Platform Model

     // get platform info

     // get device info

     // Execution Model

     // create context

     // create command queue for the device

     // create memory buffers on the device

     // Program Model

     // create program from src[]

     // build program

     // if errors encountered build log and send to output

     // create kernel

     // set kernel arguments

     // Execute

     // copy to buffers on the device

     // define execution configuration

     // launch kernel

     // extract profiling information (run_time_gpu)

     // copy to host memory (c) from the device buffer

     // Release OpenCL Resources

     // add code here

     // output errors only
     int ne = 0;
     std::cout << fixed << setprecision(6);
     for (int i = 0; i < n * n; i += n + 1)
         if (c[i] != 1.0f)
             std::cout << setw(3) << ++ne << ' ' <<
                  c[i] << endl;
     if (ne)
         std::cout << ne << " Errors encountered" << endl;
         std::cout << "No Errors encountered" << endl;
         std::cout << argv[2] << " kernel took " <<
          run_time_gpu << " microsecs" << endl;

     // deallocate host memory
     delete [] a;
     delete [] b;
     delete [] c;
     delete [] src;

Compile your application at the command line using the following command:

 > nvcc opencl.cpp -lOpenCL


Write three OpenCL kernels to calculate each coefficient of a square matrix that is the result of a product of two square matrices:

  1. naive multiplication accessing global memory directly
  2. multiplication using shared memory without coalesced access to global memory
  3. multiplication using shared memory with coalesced access to global memory


Run the executable with the different kernels for the sizes listed in the table below and report the kernel time in microseconds. 

n Naive Without Coalesced Access With Coalesced Access

Prepare a 3D look realistic column chart plotting the kernel times against n along the horizontal axis as shown below. 


You can create the chart in Open Office using the following steps:

  • Highlight data and labels
  • Select Chart in the Toolbar
  • Chart Type - check 3D Look Realistic Column
  • Data Range - 1st row as label, 1st column as label
  • Chart Elements - add title, subtitle, axes labels

Save your chart as part of your spreadsheet file.


Copy your source code for each version into a file named w10.txt.  This file should include

  • your userid
  • your OpenCL source code for the application
  • your OpenCL source code for each kernel
  • the output from compiling your code

Upload your typescript to Blackboard: 

  • Login to
  • Select your course code
  • Select Workshop 10 under Workshops
  • Upload w10.txt and w10.ods or w10.xls
  • Under "Add Comments" write a short note to your instructor:  Add a sentence or two describing what you think you have learned in this workshop.
  • When ready to submit, press "Submit"

  Designed by Chris Szalwinski   Copying From This Site   
Creative Commons License