How to make GOOD measurements in NVIDA CUDA?

It would be good read the previous anotation about measure time in NVIDIA CUDA platforms.

In this post you could find some information about the warming up of the device and the delays of some instructions.

Some people thinks that the first kernel call should not be measured because it takes more time than the following ones.

But, how much longer?

Simple kernels

With simple kernels like the following one:

// Kernel that executes on the CUDA device
__global__ void simple_kernel(float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N) {
a[idx] += a[idx] * a[idx];
}
}

The output:

[ivan@machine]$ ./warm
Time for call of the kernel #0: 0.089000 ms
Time for call of the kernel #1: 0.053000 ms
Time for call of the kernel #2: 0.051000 ms
Time for call of the kernel #3: 0.051000 ms

The differences between the first call and the followings seem to be important. The first one spends about 68% more time than the following ones. In this case it takes about 0,04 ms more.

Complex kernels

But, what will happen with a more complex kernel like this?

// Kernel that executes on the CUDA device
__global__ void complex_kernel(float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N) {
for (int i = 0; i < 1000; i++) {
a[idx] += a[idx] * a[idx];
}
}
}

The output:

[ivan@machine]$ ./warm
Time for call of the kernel #0: 1.624000 ms
Time for call of the kernel #1: 1.595000 ms
Time for call of the kernel #2: 1.589000 ms
Time for call of the kernel #3: 1.595000 ms

As you can see the first kernel takes a 0,03 ms more again. It represents only a 1,8% more than the followings exectutions!

The warming up is not here

It is necessary to avoid the measurement of the first kernel call? In my opinion, a delay of about 0,03 ms is too small to be considered.

However, you probably find that in your CUDA code there are some instructions that spend a bigger time than the expected time. Why? Lets see some examples:

cutStartTimer(select);
cudaSetDevice(0);
cudaThreadSynchronize();
cutStopTimer(select);

float *a_h, *a_d;  // Pointer to host & device arrays
const int N = 10000;  // Number of elements in arrays
size_t size = N * sizeof(float);
a_h = (float *)malloc(size);        // Allocate array on host
// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++) a_h[i] = (float)i;
int block_size = 4;
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);

cutStartTimer(Malloc);
cudaMalloc((void **) &a_d, size);   // Allocate array on device
cutStopTimer(Malloc);

cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
// Do calculation on device:
int i;
for (i = 0; i < 4; i++) {
cutStartTimer(kernel);
complex_kernel <<< n_blocks, block_size >>> (a_d, N);
cudaThreadSynchronize();
cutStopTimer(kernel);
times[i] = cutGetTimerValue(kernel);
cutResetTimer(kernel);
}

// Retrieve result from device and store it in host array
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
cudaFree(a_d);

printf (" Time for selecting the device: %f ms\n", cutGetTimerValue(select));
printf ("           Time for the malloc: %f ms\n", cutGetTimerValue(Malloc));
for (i = 0; i < 4; i++)
printf ("Time for call of the kernel #%d: %f ms\n", i, times[i]);

You could see that there are some instructions that are being measured:
  • The cudaSetDevice
  • The cudaMalloc
  • The kernel’s calls

The output:

[ivan@machine]$ ./warm
 Time for selecting the device: 3816.443115 ms
           Time for the malloc: 0.116000 ms
Time for call of the kernel #0: 1.620000 ms
Time for call of the kernel #1: 1.596000 ms
Time for call of the kernel #2: 1.585000 ms
Time for call of the kernel #3: 1.588000 ms

The cudaSetDevice instruction takes about 3,5 s!! It is a big time! More than 2000 times greater  than a kernel execution! So lets to comment the cudaSetDevice:

cutStartTimer(select);
//cudaSetDevice(0);
//cudaThreadSynchronize();
cutStopTimer(select);

And it’s corresponding output:

[ivan@machine]$ ./warm
 Time for selecting the device: 0.000000 ms
           Time for the malloc: 3824.311035 ms
Time for call of the kernel #0: 1.618000 ms
Time for call of the kernel #1: 1.595000 ms
Time for call of the kernel #2: 1.593000 ms
Time for call of the kernel #3: 1.588000 ms

What happened? In this case, the malloc instruction takes almost 4 s!

1st conclusion – The first CUDA sentence takes a long time

The first CUDA instruction (cudaSetDevice, a kernel launch, cudaMalloc) will take about 3,5 s more than usual. This CUDA overhead only occurs in the first instruction of a program.

But, there is the only overhead we can find in a CUDA program? Not, lets see what happens if we measure some cudaMalloc:

cutStartTimer(Malloc);
cudaMalloc((void **) &a_d, size);   // Allocate array on device
cudaThreadSynchronize();
cutStopTimer(Malloc);
cutStartTimer(secondMalloc);
cudaMalloc((void **) &a_d2, size);   // Allocate array on device
cudaThreadSynchronize();
cutStopTimer(secondMalloc);

cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
cudaMemcpy(a_d2, a_h2, size, cudaMemcpyHostToDevice);
// Do calculation on device:
int i;
for (i = 0; i < 4; i++) {
cutStartTimer(kernel);
complex_kernel <<< n_blocks, block_size >>> (a_d, N);
complex_kernel <<< n_blocks, block_size >>> (a_d2, N);
cudaThreadSynchronize();
cutStopTimer(kernel);
times[i] = cutGetTimerValue(kernel);
cutResetTimer(kernel);
}

// Retrieve result from device and store it in host array
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
cudaMemcpy(a_h2, a_d2, sizeof(float)*N, cudaMemcpyDeviceToHost);
cudaFree(a_d);
//cudaFree(a_d2);

In the preceding code there are two identical cudaMallocs. But the measured times are very differents:
[ivan@machine]$ ./warm
Time for selecting the device: 3818.165039 ms
Time for the malloc: 0.163000 ms
Time for the 2malloc: 0.026000 ms
Time for call of the kernel #0: 3.207000 ms
Time for call of the kernel #1: 3.166000 ms
Time for call of the kernel #2: 3.165000 ms
Time for call of the kernel #3: 3.165000 ms

The first cudaMalloc spends 8 more times than the second one.

Second conclusion – The first cudaMalloc takes longer time than the following ones

Maybe this overhead is used to perform some bus initialization.

Third conclusion- The cudaThreadSynchronize sentences are very important

The measured times could differ a lot if cudaThreadSynchronize are not used properly. An example:

for (i = 0; i < 4; i++) {
cutStartTimer(kernel);
complex_kernel <<< n_blocks, block_size >>> (a_d, N);

cutStopTimer(kernel);
times[i] = cutGetTimerValue(kernel);
cutResetTimer(kernel);
}

The output:

[ivan@machine]$ ./warm
Time for selecting the device: 3829.884033 ms
Time for the malloc: 0.113000 ms
Time for the 2malloc: 0.018000 ms
Time for call of the kernel #0: 0.021000 ms
Time for call of the kernel #1: 0.004000 ms
Time for call of the kernel #2: 0.002000 ms
Time for call of the kernel #3: 0.003000 ms

The corrected code:

for (i = 0; i < 4; i++) {
cutStartTimer(kernel);
complex_kernel <<< n_blocks, block_size >>> (a_d, N);
cudaThreadSynchronize();
cutStopTimer(kernel);
times[i] = cutGetTimerValue(kernel);
cutResetTimer(kernel);
}

The new output:

[ivan@machine]$ ./warm
Time for selecting the device: 3837.173096 ms
Time for the malloc: 0.159000 ms
Time for the 2malloc: 0.026000 ms
Time for call of the kernel #0: 0.272000 ms
Time for call of the kernel #1: 0.243000 ms
Time for call of the kernel #2: 0.237000 ms
Time for call of the kernel #3: 0.235000 ms

I hope this post will help you with the NVIDA CUDA delays, overheads and help understanding the time measures in this platform.

Anuncios

How to measure time in NVIDA CUDA?

This post will show you some points about how to measure time in Cuda.

Reading the documentation about Cuda you could find two ways:

  • cutStartTimer(myTimer)
  • Events

Events are a bit more sophisticated and, if your code uses asynchronous kernels, you must to use it. But, how could you know if a code has an asynchronous kernel or not?

To let a code be asynchronous the programmer must create streams with the input data and transfers it to the device using the instruction:

cudaMemcpyAsync

In conclusion, if in the code there is not any instruction like ‘cudaStreamCreate’ and ‘cudaMemcpyAsync’ you cold assume that your code is synchronous (simplifying the measurements).

Measuring with the cut{Start|Stop}Timer

It is very important to use the instruction cudaThreadSynchronize() to avoid erroneous measurements.

The code is bellow:

 uint kernelTime;
  cutCreateTimer(&kernelTime);
  cutResetTimer(kernelTime);

  cudaMalloc((void **) &a_d, size);   // Allocate array on device
  cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
  cutStartTimer(kernelTime);
  // Do calculation on device:
  square_array <<< n_blocks, block_size >>> (a_d, N);
  cudaThreadSynchronize();
  cutStopTimer(kernelTime);

  // Retrieve result from device and store it in host array
  cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
  cudaFree(a_d);
  printf ("Time for the kernel: %f ms\n", cutGetTimerValue(kernelTime));

The output:

[ivan@machine]$ ./timer
Device name : Tesla C2050
Time for selecting the device: 3423.731934 ms
Time for the kernel: 0.068000 ms

Measuring with events

The events are more precise and an example of use is bellow:

  cudaEvent_t start, stop;
  float time;
  cudaEventCreate(&start);
  cudaEventCreate(&stop);

  cudaMalloc((void **) &a_d, size);   // Allocate array on device
  cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
  cudaEventRecord(start, 0);
  // Do calculation on device:
  square_array <<< n_blocks, block_size >>> (a_d, N);
  cudaEventRecord(stop, 0);
  cudaEventSynchronize(stop);

  // Retrieve result from device and store it in host array
  cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
  cudaFree(a_d);
  cudaEventElapsedTime(&time, start, stop);
  printf ("Time for the kernel: %f ms\n", time);

The ouput:

[ivan@machine]$ ./event
Device name : Tesla C2050
Time for selecting the device: 3819.466064 ms
Time for the kernel: 0.041632 ms

In short

If the code is asynchronous you could use any of the two ways introduced before. If the kernel is a bit more complex the differences between them are smaller:

[ivan@garoe tiempo]$ ./timer
Device name : Tesla C2050
Time for selecting the device: 3824.726074 ms
Time for the kernel: 1.619000 ms

[ivan@garoe tiempo]$ ./event
Device name : Tesla C2050
Time for selecting the device: 3859.903076 ms
Time for the kernel: 1.600992 ms