This post will show you some points about how to measure time in Cuda.
Reading the documentation about Cuda you could find two ways:
- cutStartTimer(myTimer)
- Events
Events are a bit more sophisticated and, if your code uses asynchronous kernels, you must to use it. But, how could you know if a code has an asynchronous kernel or not?
To let a code be asynchronous the programmer must create streams with the input data and transfers it to the device using the instruction:
cudaMemcpyAsync
In conclusion, if in the code there is not any instruction like ‘cudaStreamCreate’ and ‘cudaMemcpyAsync’ you cold assume that your code is synchronous (simplifying the measurements).
Measuring with the cut{Start|Stop}Timer
It is very important to use the instruction cudaThreadSynchronize() to avoid erroneous measurements.
The code is bellow:
uint kernelTime;
cutCreateTimer(&kernelTime);
cutResetTimer(kernelTime);
cudaMalloc((void **) &a_d, size); // Allocate array on device
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
cutStartTimer(kernelTime);
// Do calculation on device:
square_array <<< n_blocks, block_size >>> (a_d, N);
cudaThreadSynchronize();
cutStopTimer(kernelTime);
// Retrieve result from device and store it in host array
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
cudaFree(a_d);
printf ("Time for the kernel: %f ms\n", cutGetTimerValue(kernelTime));
The output:
[ivan@machine]$ ./timer Device name : Tesla C2050 Time for selecting the device: 3423.731934 ms Time for the kernel: 0.068000 ms
Measuring with events
The events are more precise and an example of use is bellow:
cudaEvent_t start, stop;
float time;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaMalloc((void **) &a_d, size); // Allocate array on device
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
cudaEventRecord(start, 0);
// Do calculation on device:
square_array <<< n_blocks, block_size >>> (a_d, N);
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
// Retrieve result from device and store it in host array
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
cudaFree(a_d);
cudaEventElapsedTime(&time, start, stop);
printf ("Time for the kernel: %f ms\n", time);
The ouput:
[ivan@machine]$ ./event Device name : Tesla C2050 Time for selecting the device: 3819.466064 ms Time for the kernel: 0.041632 ms
In short
If the code is asynchronous you could use any of the two ways introduced before. If the kernel is a bit more complex the differences between them are smaller:
[ivan@garoe tiempo]$ ./timer Device name : Tesla C2050 Time for selecting the device: 3824.726074 ms Time for the kernel: 1.619000 ms [ivan@garoe tiempo]$ ./event Device name : Tesla C2050 Time for selecting the device: 3859.903076 ms Time for the kernel: 1.600992 ms



[...] would be good read the previous anotation about measure time in NVIDIA CUDA [...]
[...] How to measure time in NVIDA CUDA? [...]