How to make GOOD measurements in NVIDA CUDA?

It would be good read the previous anotation about measure time in NVIDIA CUDA platforms.

In this post you could find some information about the warming up of the device and the delays of some instructions.

Some people thinks that the first kernel call should not be measured because it takes more time than the following ones.

But, how much longer?

Simple kernels

With simple kernels like the following one:

// Kernel that executes on the CUDA device
__global__ void simple_kernel(float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N) {
a[idx] += a[idx] * a[idx];
}
}

The output:

[ivan@machine]$ ./warm
Time for call of the kernel #0: 0.089000 ms
Time for call of the kernel #1: 0.053000 ms
Time for call of the kernel #2: 0.051000 ms
Time for call of the kernel #3: 0.051000 ms

The differences between the first call and the followings seem to be important. The first one spends about 68% more time than the following ones. In this case it takes about 0,04 ms more.

Complex kernels

But, what will happen with a more complex kernel like this?

// Kernel that executes on the CUDA device
__global__ void complex_kernel(float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N) {
for (int i = 0; i < 1000; i++) {
a[idx] += a[idx] * a[idx];
}
}
}

The output:

[ivan@machine]$ ./warm
Time for call of the kernel #0: 1.624000 ms
Time for call of the kernel #1: 1.595000 ms
Time for call of the kernel #2: 1.589000 ms
Time for call of the kernel #3: 1.595000 ms

As you can see the first kernel takes a 0,03 ms more again. It represents only a 1,8% more than the followings exectutions!

The warming up is not here

It is necessary to avoid the measurement of the first kernel call? In my opinion, a delay of about 0,03 ms is too small to be considered.

However, you probably find that in your CUDA code there are some instructions that spend a bigger time than the expected time. Why? Lets see some examples:

cutStartTimer(select);
cudaSetDevice(0);
cudaThreadSynchronize();
cutStopTimer(select);

float *a_h, *a_d;  // Pointer to host & device arrays
const int N = 10000;  // Number of elements in arrays
size_t size = N * sizeof(float);
a_h = (float *)malloc(size);        // Allocate array on host
// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++) a_h[i] = (float)i;
int block_size = 4;
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);

cutStartTimer(Malloc);
cudaMalloc((void **) &a_d, size);   // Allocate array on device
cutStopTimer(Malloc);

cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
// Do calculation on device:
int i;
for (i = 0; i < 4; i++) {
cutStartTimer(kernel);
complex_kernel <<< n_blocks, block_size >>> (a_d, N);
cudaThreadSynchronize();
cutStopTimer(kernel);
times[i] = cutGetTimerValue(kernel);
cutResetTimer(kernel);
}

// Retrieve result from device and store it in host array
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
cudaFree(a_d);

printf (" Time for selecting the device: %f ms\n", cutGetTimerValue(select));
printf ("           Time for the malloc: %f ms\n", cutGetTimerValue(Malloc));
for (i = 0; i < 4; i++)
printf ("Time for call of the kernel #%d: %f ms\n", i, times[i]);

You could see that there are some instructions that are being measured:
  • The cudaSetDevice
  • The cudaMalloc
  • The kernel’s calls

The output:

[ivan@machine]$ ./warm
 Time for selecting the device: 3816.443115 ms
           Time for the malloc: 0.116000 ms
Time for call of the kernel #0: 1.620000 ms
Time for call of the kernel #1: 1.596000 ms
Time for call of the kernel #2: 1.585000 ms
Time for call of the kernel #3: 1.588000 ms

The cudaSetDevice instruction takes about 3,5 s!! It is a big time! More than 2000 times greater  than a kernel execution! So lets to comment the cudaSetDevice:

cutStartTimer(select);
//cudaSetDevice(0);
//cudaThreadSynchronize();
cutStopTimer(select);

And it’s corresponding output:

[ivan@machine]$ ./warm
 Time for selecting the device: 0.000000 ms
           Time for the malloc: 3824.311035 ms
Time for call of the kernel #0: 1.618000 ms
Time for call of the kernel #1: 1.595000 ms
Time for call of the kernel #2: 1.593000 ms
Time for call of the kernel #3: 1.588000 ms

What happened? In this case, the malloc instruction takes almost 4 s!

1st conclusion – The first CUDA sentence takes a long time

The first CUDA instruction (cudaSetDevice, a kernel launch, cudaMalloc) will take about 3,5 s more than usual. This CUDA overhead only occurs in the first instruction of a program.

But, there is the only overhead we can find in a CUDA program? Not, lets see what happens if we measure some cudaMalloc:

cutStartTimer(Malloc);
cudaMalloc((void **) &a_d, size);   // Allocate array on device
cudaThreadSynchronize();
cutStopTimer(Malloc);
cutStartTimer(secondMalloc);
cudaMalloc((void **) &a_d2, size);   // Allocate array on device
cudaThreadSynchronize();
cutStopTimer(secondMalloc);

cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
cudaMemcpy(a_d2, a_h2, size, cudaMemcpyHostToDevice);
// Do calculation on device:
int i;
for (i = 0; i < 4; i++) {
cutStartTimer(kernel);
complex_kernel <<< n_blocks, block_size >>> (a_d, N);
complex_kernel <<< n_blocks, block_size >>> (a_d2, N);
cudaThreadSynchronize();
cutStopTimer(kernel);
times[i] = cutGetTimerValue(kernel);
cutResetTimer(kernel);
}

// Retrieve result from device and store it in host array
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
cudaMemcpy(a_h2, a_d2, sizeof(float)*N, cudaMemcpyDeviceToHost);
cudaFree(a_d);
//cudaFree(a_d2);

In the preceding code there are two identical cudaMallocs. But the measured times are very differents:
[ivan@machine]$ ./warm
Time for selecting the device: 3818.165039 ms
Time for the malloc: 0.163000 ms
Time for the 2malloc: 0.026000 ms
Time for call of the kernel #0: 3.207000 ms
Time for call of the kernel #1: 3.166000 ms
Time for call of the kernel #2: 3.165000 ms
Time for call of the kernel #3: 3.165000 ms

The first cudaMalloc spends 8 more times than the second one.

Second conclusion – The first cudaMalloc takes longer time than the following ones

Maybe this overhead is used to perform some bus initialization.

Third conclusion- The cudaThreadSynchronize sentences are very important

The measured times could differ a lot if cudaThreadSynchronize are not used properly. An example:

for (i = 0; i < 4; i++) {
cutStartTimer(kernel);
complex_kernel <<< n_blocks, block_size >>> (a_d, N);

cutStopTimer(kernel);
times[i] = cutGetTimerValue(kernel);
cutResetTimer(kernel);
}

The output:

[ivan@machine]$ ./warm
Time for selecting the device: 3829.884033 ms
Time for the malloc: 0.113000 ms
Time for the 2malloc: 0.018000 ms
Time for call of the kernel #0: 0.021000 ms
Time for call of the kernel #1: 0.004000 ms
Time for call of the kernel #2: 0.002000 ms
Time for call of the kernel #3: 0.003000 ms

The corrected code:

for (i = 0; i < 4; i++) {
cutStartTimer(kernel);
complex_kernel <<< n_blocks, block_size >>> (a_d, N);
cudaThreadSynchronize();
cutStopTimer(kernel);
times[i] = cutGetTimerValue(kernel);
cutResetTimer(kernel);
}

The new output:

[ivan@machine]$ ./warm
Time for selecting the device: 3837.173096 ms
Time for the malloc: 0.159000 ms
Time for the 2malloc: 0.026000 ms
Time for call of the kernel #0: 0.272000 ms
Time for call of the kernel #1: 0.243000 ms
Time for call of the kernel #2: 0.237000 ms
Time for call of the kernel #3: 0.235000 ms

I hope this post will help you with the NVIDA CUDA delays, overheads and help understanding the time measures in this platform.

2 comentarios sobre “How to make GOOD measurements in NVIDA CUDA?

  1. In the third conclussion i want to add that this occurs because when you call a kernel this call is asynchronous and the execution on the cpu side continues. You have to use cudaThreadSynchronize to allow cpu wait until kernel execution finish

Deja un comentario