How to measure time in NVIDA CUDA?

This post will show you some points about how to measure time in Cuda.

Reading the documentation about Cuda you could find two ways:

  • cutStartTimer(myTimer)
  • Events

Events are a bit more sophisticated and, if your code uses asynchronous kernels, you must to use it. But, how could you know if a code has an asynchronous kernel or not?

To let a code be asynchronous the programmer must create streams with the input data and transfers it to the device using the instruction:

cudaMemcpyAsync

In conclusion, if in the code there is not any instruction like ‘cudaStreamCreate’ and ‘cudaMemcpyAsync’ you cold assume that your code is synchronous (simplifying the measurements).

Measuring with the cut{Start|Stop}Timer

It is very important to use the instruction cudaThreadSynchronize() to avoid erroneous measurements.

The code is bellow:

 uint kernelTime;
  cutCreateTimer(&kernelTime);
  cutResetTimer(kernelTime);

  cudaMalloc((void **) &a_d, size);   // Allocate array on device
  cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
  cutStartTimer(kernelTime);
  // Do calculation on device:
  square_array <<< n_blocks, block_size >>> (a_d, N);
  cudaThreadSynchronize();
  cutStopTimer(kernelTime);

  // Retrieve result from device and store it in host array
  cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
  cudaFree(a_d);
  printf ("Time for the kernel: %f ms\n", cutGetTimerValue(kernelTime));

The output:

[ivan@machine]$ ./timer
Device name : Tesla C2050
Time for selecting the device: 3423.731934 ms
Time for the kernel: 0.068000 ms

Measuring with events

The events are more precise and an example of use is bellow:

  cudaEvent_t start, stop;
  float time;
  cudaEventCreate(&start);
  cudaEventCreate(&stop);

  cudaMalloc((void **) &a_d, size);   // Allocate array on device
  cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
  cudaEventRecord(start, 0);
  // Do calculation on device:
  square_array <<< n_blocks, block_size >>> (a_d, N);
  cudaEventRecord(stop, 0);
  cudaEventSynchronize(stop);

  // Retrieve result from device and store it in host array
  cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
  cudaFree(a_d);
  cudaEventElapsedTime(&time, start, stop);
  printf ("Time for the kernel: %f ms\n", time);

The ouput:

[ivan@machine]$ ./event
Device name : Tesla C2050
Time for selecting the device: 3819.466064 ms
Time for the kernel: 0.041632 ms

In short

If the code is asynchronous you could use any of the two ways introduced before. If the kernel is a bit more complex the differences between them are smaller:

[ivan@garoe tiempo]$ ./timer
Device name : Tesla C2050
Time for selecting the device: 3824.726074 ms
Time for the kernel: 1.619000 ms

[ivan@garoe tiempo]$ ./event
Device name : Tesla C2050
Time for selecting the device: 3859.903076 ms
Time for the kernel: 1.600992 ms