It would be good read the previous anotation about measure time in NVIDIA CUDA platforms.
In this post you could find some information about the warming up of the device and the delays of some instructions.
Some people thinks that the first kernel call should not be measured because it takes more time than the following ones.
But, how much longer?
Simple kernels
With simple kernels like the following one:
// Kernel that executes on the CUDA device __global__ void simple_kernel(float *a, int N) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx<N) { a[idx] += a[idx] * a[idx]; } }
The output:
[ivan@machine]$ ./warm Time for call of the kernel #0: 0.089000 ms Time for call of the kernel #1: 0.053000 ms Time for call of the kernel #2: 0.051000 ms Time for call of the kernel #3: 0.051000 ms
The differences between the first call and the followings seem to be important. The first one spends about 68% more time than the following ones. In this case it takes about 0,04 ms more.
Complex kernels
But, what will happen with a more complex kernel like this?
// Kernel that executes on the CUDA device __global__ void complex_kernel(float *a, int N) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx<N) { for (int i = 0; i < 1000; i++) { a[idx] += a[idx] * a[idx]; } } }
The output:
[ivan@machine]$ ./warm Time for call of the kernel #0: 1.624000 ms Time for call of the kernel #1: 1.595000 ms Time for call of the kernel #2: 1.589000 ms Time for call of the kernel #3: 1.595000 ms
As you can see the first kernel takes a 0,03 ms more again. It represents only a 1,8% more than the followings exectutions!
The warming up is not here
It is necessary to avoid the measurement of the first kernel call? In my opinion, a delay of about 0,03 ms is too small to be considered.
However, you probably find that in your CUDA code there are some instructions that spend a bigger time than the expected time. Why? Lets see some examples:
cutStartTimer(select); cudaSetDevice(0); cudaThreadSynchronize(); cutStopTimer(select); float *a_h, *a_d; // Pointer to host & device arrays const int N = 10000; // Number of elements in arrays size_t size = N * sizeof(float); a_h = (float *)malloc(size); // Allocate array on host // Initialize host array and copy it to CUDA device for (int i=0; i<N; i++) a_h[i] = (float)i; int block_size = 4; int n_blocks = N/block_size + (N%block_size == 0 ? 0:1); cutStartTimer(Malloc); cudaMalloc((void **) &a_d, size); // Allocate array on device cutStopTimer(Malloc); cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice); // Do calculation on device: int i; for (i = 0; i < 4; i++) { cutStartTimer(kernel); complex_kernel <<< n_blocks, block_size >>> (a_d, N); cudaThreadSynchronize(); cutStopTimer(kernel); times[i] = cutGetTimerValue(kernel); cutResetTimer(kernel); } // Retrieve result from device and store it in host array cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost); cudaFree(a_d); printf (" Time for selecting the device: %f ms\n", cutGetTimerValue(select)); printf (" Time for the malloc: %f ms\n", cutGetTimerValue(Malloc)); for (i = 0; i < 4; i++) printf ("Time for call of the kernel #%d: %f ms\n", i, times[i]);
- The cudaSetDevice
- The cudaMalloc
- The kernel’s calls
The output:
[ivan@machine]$ ./warm Time for selecting the device: 3816.443115 ms Time for the malloc: 0.116000 ms Time for call of the kernel #0: 1.620000 ms Time for call of the kernel #1: 1.596000 ms Time for call of the kernel #2: 1.585000 ms Time for call of the kernel #3: 1.588000 ms
The cudaSetDevice instruction takes about 3,5 s!! It is a big time! More than 2000 times greater than a kernel execution! So lets to comment the cudaSetDevice:
cutStartTimer(select); //cudaSetDevice(0); //cudaThreadSynchronize(); cutStopTimer(select);
And it’s corresponding output:
[ivan@machine]$ ./warm Time for selecting the device: 0.000000 ms Time for the malloc: 3824.311035 ms Time for call of the kernel #0: 1.618000 ms Time for call of the kernel #1: 1.595000 ms Time for call of the kernel #2: 1.593000 ms Time for call of the kernel #3: 1.588000 ms
What happened? In this case, the malloc instruction takes almost 4 s!
1st conclusion – The first CUDA sentence takes a long time
The first CUDA instruction (cudaSetDevice, a kernel launch, cudaMalloc) will take about 3,5 s more than usual. This CUDA overhead only occurs in the first instruction of a program.
But, there is the only overhead we can find in a CUDA program? Not, lets see what happens if we measure some cudaMalloc:
cutStartTimer(Malloc); cudaMalloc((void **) &a_d, size); // Allocate array on device cudaThreadSynchronize(); cutStopTimer(Malloc); cutStartTimer(secondMalloc); cudaMalloc((void **) &a_d2, size); // Allocate array on device cudaThreadSynchronize(); cutStopTimer(secondMalloc); cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice); cudaMemcpy(a_d2, a_h2, size, cudaMemcpyHostToDevice); // Do calculation on device: int i; for (i = 0; i < 4; i++) { cutStartTimer(kernel); complex_kernel <<< n_blocks, block_size >>> (a_d, N); complex_kernel <<< n_blocks, block_size >>> (a_d2, N); cudaThreadSynchronize(); cutStopTimer(kernel); times[i] = cutGetTimerValue(kernel); cutResetTimer(kernel); } // Retrieve result from device and store it in host array cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost); cudaMemcpy(a_h2, a_d2, sizeof(float)*N, cudaMemcpyDeviceToHost); cudaFree(a_d); //cudaFree(a_d2);
In the preceding code there are two identical cudaMallocs. But the measured times are very differents:
[ivan@machine]$ ./warm
Time for selecting the device: 3818.165039 ms
Time for the malloc: 0.163000 ms
Time for the 2malloc: 0.026000 ms
Time for call of the kernel #0: 3.207000 ms
Time for call of the kernel #1: 3.166000 ms
Time for call of the kernel #2: 3.165000 ms
Time for call of the kernel #3: 3.165000 ms
The first cudaMalloc spends 8 more times than the second one.
Second conclusion – The first cudaMalloc takes longer time than the following ones
Maybe this overhead is used to perform some bus initialization.
Third conclusion- The cudaThreadSynchronize sentences are very important
The measured times could differ a lot if cudaThreadSynchronize are not used properly. An example:
for (i = 0; i < 4; i++) { cutStartTimer(kernel); complex_kernel <<< n_blocks, block_size >>> (a_d, N); cutStopTimer(kernel); times[i] = cutGetTimerValue(kernel); cutResetTimer(kernel); }
The output:
[ivan@machine]$ ./warm
Time for selecting the device: 3829.884033 ms
Time for the malloc: 0.113000 ms
Time for the 2malloc: 0.018000 ms
Time for call of the kernel #0: 0.021000 ms
Time for call of the kernel #1: 0.004000 ms
Time for call of the kernel #2: 0.002000 ms
Time for call of the kernel #3: 0.003000 ms
The corrected code:
for (i = 0; i < 4; i++) { cutStartTimer(kernel); complex_kernel <<< n_blocks, block_size >>> (a_d, N); cudaThreadSynchronize(); cutStopTimer(kernel); times[i] = cutGetTimerValue(kernel); cutResetTimer(kernel); }
The new output:
[ivan@machine]$ ./warm
Time for selecting the device: 3837.173096 ms
Time for the malloc: 0.159000 ms
Time for the 2malloc: 0.026000 ms
Time for call of the kernel #0: 0.272000 ms
Time for call of the kernel #1: 0.243000 ms
Time for call of the kernel #2: 0.237000 ms
Time for call of the kernel #3: 0.235000 ms
I hope this post will help you with the NVIDA CUDA delays, overheads and help understanding the time measures in this platform.
In the third conclussion i want to add that this occurs because when you call a kernel this call is asynchronous and the execution on the cpu side continues. You have to use cudaThreadSynchronize to allow cpu wait until kernel execution finish
I found this post interesting, especially the first conclusion. Did you do the same test using events?
Best regards,
Pablo.