¿Existe la amistad después de la relación?

Muerte a contraluz, originally uploaded by Ivan_Lopez.

El título, aunque parezca de broma, es un tema sobre el que no suele haber consenso entre mis amistades. Muchos piensan que es mejor cortar de forma radical. Guardar los regalos, quitar sus fotos, eliminarl@ del tuenti/facebook y no volver a quedar nunca más.

Yo no soy así. He tenido la suerte de que mis relaciones han acabado más o menos bien. Sin engaños ni ninguna cosa muy grave. Por lo tanto, me dolaría saber que la otra persona me borra totalmente de su vida. Es como negar lo que pasó. Es como decir “me equivoqué”. Si he estado contigo es porque me importas. Aunque ya no podamos estar juntos me sigo preocupando por ti. Me gustaría que todo te fuese bien y quiero que sepas que puedes seguir contando conmigo. Muchas veces, cuando estamos con alguien, no nos damos cuenta pero vamos robándole tiempo a nuestros amigos. Por eso, tras una ruptura estamos más indefensos de lo normal. Si alguien te ha hecho daño y no tienes con quién hablarlo… ¡puedes llamarme!

Incluso poco después de dejarlo creo que quedar con la otra persona es beneficioso. Aunque es un arma de doble filo. Las primeras veces serán MUY DURAS. Y le darás vueltas a cada palabra que salga de su boca. Y te quedarás hecho polvo viendo como las cosas ya no son lo que eran. Pero es la forma más fácil de conseguir pasar página. Las siguientes veces irán siendo cada vez más fáciles y volverás a contar con una amiga. Una amiga muy valiosa porque te conoce muy bien por lo que han pasado juntos.

Los que optan por no volver a ver a sus ex. Cuando se los cruzan por cualquier motivo, de fiesta o lo que sea, se llevan un hachazo. Y seguirá así por mucho tiempo. Al final el tiempo todo lo cura. Pero de forma más lenta. Y habrás perdido por el camino una persona que seguramente vale la pena pues tomaste la decisión de estar con ella, ¿no?

También hay que ser realistas. A mí me gusta ser optimista pero eso no quita que haya que ser objetivo. Lo primero es que nunca se puede volver al punto anterior. Siempre se pierde “algo” por el camino. Se pierde algo de esa complicidad, algo de esa magia, química o como quieras llamarlo. También es verdad que cada persona puede necesitar un tiempo diferente para poder retomar la amistad. Hay quien necesita hacer un paréntesis. Incluso estar con terceras personas antes de poder hablar con naturalidad con la otra persona.

Como le dije no hace mucho a alguien: el mundo de los consejos es como un supermercado. Tú vas mirando y sólo coges aquello que te quieras llevar.

Con esa filosofía escribo este blog. Me ayuda a sacar todo eso que llevo dentro. Todas las ideas que hacen run-run en mi interior. Conversaciones a medias con mis amig@s. Comentarios, mensajes privados… al final me gusta plasmarlos por aquí. Es como lanzar botellas al mar con un mensaje en su interior.

No sabes quién las va a poder recibir. Pero mantienes la esperanza de que alguien lo haga. De que a alguien le pueda aportar un nuevo punto de vista, ayudarle a pasar un mal momento o sacarle una sonrisa 😉

How to make GOOD measurements in NVIDA CUDA?

It would be good read the previous anotation about measure time in NVIDIA CUDA platforms.

In this post you could find some information about the warming up of the device and the delays of some instructions.

Some people thinks that the first kernel call should not be measured because it takes more time than the following ones.

But, how much longer?

Simple kernels

With simple kernels like the following one:

// Kernel that executes on the CUDA device
__global__ void simple_kernel(float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N) {
a[idx] += a[idx] * a[idx];
}
}

The output:

[ivan@machine]$ ./warm
Time for call of the kernel #0: 0.089000 ms
Time for call of the kernel #1: 0.053000 ms
Time for call of the kernel #2: 0.051000 ms
Time for call of the kernel #3: 0.051000 ms

The differences between the first call and the followings seem to be important. The first one spends about 68% more time than the following ones. In this case it takes about 0,04 ms more.

Complex kernels

But, what will happen with a more complex kernel like this?

// Kernel that executes on the CUDA device
__global__ void complex_kernel(float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N) {
for (int i = 0; i < 1000; i++) {
a[idx] += a[idx] * a[idx];
}
}
}

The output:

[ivan@machine]$ ./warm
Time for call of the kernel #0: 1.624000 ms
Time for call of the kernel #1: 1.595000 ms
Time for call of the kernel #2: 1.589000 ms
Time for call of the kernel #3: 1.595000 ms

As you can see the first kernel takes a 0,03 ms more again. It represents only a 1,8% more than the followings exectutions!

The warming up is not here

It is necessary to avoid the measurement of the first kernel call? In my opinion, a delay of about 0,03 ms is too small to be considered.

However, you probably find that in your CUDA code there are some instructions that spend a bigger time than the expected time. Why? Lets see some examples:

cutStartTimer(select);
cudaSetDevice(0);
cudaThreadSynchronize();
cutStopTimer(select);

float *a_h, *a_d;  // Pointer to host & device arrays
const int N = 10000;  // Number of elements in arrays
size_t size = N * sizeof(float);
a_h = (float *)malloc(size);        // Allocate array on host
// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++) a_h[i] = (float)i;
int block_size = 4;
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);

cutStartTimer(Malloc);
cudaMalloc((void **) &a_d, size);   // Allocate array on device
cutStopTimer(Malloc);

cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
// Do calculation on device:
int i;
for (i = 0; i < 4; i++) {
cutStartTimer(kernel);
complex_kernel <<< n_blocks, block_size >>> (a_d, N);
cudaThreadSynchronize();
cutStopTimer(kernel);
times[i] = cutGetTimerValue(kernel);
cutResetTimer(kernel);
}

// Retrieve result from device and store it in host array
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
cudaFree(a_d);

printf (" Time for selecting the device: %f ms\n", cutGetTimerValue(select));
printf ("           Time for the malloc: %f ms\n", cutGetTimerValue(Malloc));
for (i = 0; i < 4; i++)
printf ("Time for call of the kernel #%d: %f ms\n", i, times[i]);

You could see that there are some instructions that are being measured:
  • The cudaSetDevice
  • The cudaMalloc
  • The kernel’s calls

The output:

[ivan@machine]$ ./warm
 Time for selecting the device: 3816.443115 ms
           Time for the malloc: 0.116000 ms
Time for call of the kernel #0: 1.620000 ms
Time for call of the kernel #1: 1.596000 ms
Time for call of the kernel #2: 1.585000 ms
Time for call of the kernel #3: 1.588000 ms

The cudaSetDevice instruction takes about 3,5 s!! It is a big time! More than 2000 times greater  than a kernel execution! So lets to comment the cudaSetDevice:

cutStartTimer(select);
//cudaSetDevice(0);
//cudaThreadSynchronize();
cutStopTimer(select);

And it’s corresponding output:

[ivan@machine]$ ./warm
 Time for selecting the device: 0.000000 ms
           Time for the malloc: 3824.311035 ms
Time for call of the kernel #0: 1.618000 ms
Time for call of the kernel #1: 1.595000 ms
Time for call of the kernel #2: 1.593000 ms
Time for call of the kernel #3: 1.588000 ms

What happened? In this case, the malloc instruction takes almost 4 s!

1st conclusion – The first CUDA sentence takes a long time

The first CUDA instruction (cudaSetDevice, a kernel launch, cudaMalloc) will take about 3,5 s more than usual. This CUDA overhead only occurs in the first instruction of a program.

But, there is the only overhead we can find in a CUDA program? Not, lets see what happens if we measure some cudaMalloc:

cutStartTimer(Malloc);
cudaMalloc((void **) &a_d, size);   // Allocate array on device
cudaThreadSynchronize();
cutStopTimer(Malloc);
cutStartTimer(secondMalloc);
cudaMalloc((void **) &a_d2, size);   // Allocate array on device
cudaThreadSynchronize();
cutStopTimer(secondMalloc);

cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
cudaMemcpy(a_d2, a_h2, size, cudaMemcpyHostToDevice);
// Do calculation on device:
int i;
for (i = 0; i < 4; i++) {
cutStartTimer(kernel);
complex_kernel <<< n_blocks, block_size >>> (a_d, N);
complex_kernel <<< n_blocks, block_size >>> (a_d2, N);
cudaThreadSynchronize();
cutStopTimer(kernel);
times[i] = cutGetTimerValue(kernel);
cutResetTimer(kernel);
}

// Retrieve result from device and store it in host array
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
cudaMemcpy(a_h2, a_d2, sizeof(float)*N, cudaMemcpyDeviceToHost);
cudaFree(a_d);
//cudaFree(a_d2);

In the preceding code there are two identical cudaMallocs. But the measured times are very differents:
[ivan@machine]$ ./warm
Time for selecting the device: 3818.165039 ms
Time for the malloc: 0.163000 ms
Time for the 2malloc: 0.026000 ms
Time for call of the kernel #0: 3.207000 ms
Time for call of the kernel #1: 3.166000 ms
Time for call of the kernel #2: 3.165000 ms
Time for call of the kernel #3: 3.165000 ms

The first cudaMalloc spends 8 more times than the second one.

Second conclusion – The first cudaMalloc takes longer time than the following ones

Maybe this overhead is used to perform some bus initialization.

Third conclusion- The cudaThreadSynchronize sentences are very important

The measured times could differ a lot if cudaThreadSynchronize are not used properly. An example:

for (i = 0; i < 4; i++) {
cutStartTimer(kernel);
complex_kernel <<< n_blocks, block_size >>> (a_d, N);

cutStopTimer(kernel);
times[i] = cutGetTimerValue(kernel);
cutResetTimer(kernel);
}

The output:

[ivan@machine]$ ./warm
Time for selecting the device: 3829.884033 ms
Time for the malloc: 0.113000 ms
Time for the 2malloc: 0.018000 ms
Time for call of the kernel #0: 0.021000 ms
Time for call of the kernel #1: 0.004000 ms
Time for call of the kernel #2: 0.002000 ms
Time for call of the kernel #3: 0.003000 ms

The corrected code:

for (i = 0; i < 4; i++) {
cutStartTimer(kernel);
complex_kernel <<< n_blocks, block_size >>> (a_d, N);
cudaThreadSynchronize();
cutStopTimer(kernel);
times[i] = cutGetTimerValue(kernel);
cutResetTimer(kernel);
}

The new output:

[ivan@machine]$ ./warm
Time for selecting the device: 3837.173096 ms
Time for the malloc: 0.159000 ms
Time for the 2malloc: 0.026000 ms
Time for call of the kernel #0: 0.272000 ms
Time for call of the kernel #1: 0.243000 ms
Time for call of the kernel #2: 0.237000 ms
Time for call of the kernel #3: 0.235000 ms

I hope this post will help you with the NVIDA CUDA delays, overheads and help understanding the time measures in this platform.

How to measure time in NVIDA CUDA?

This post will show you some points about how to measure time in Cuda.

Reading the documentation about Cuda you could find two ways:

  • cutStartTimer(myTimer)
  • Events

Events are a bit more sophisticated and, if your code uses asynchronous kernels, you must to use it. But, how could you know if a code has an asynchronous kernel or not?

To let a code be asynchronous the programmer must create streams with the input data and transfers it to the device using the instruction:

cudaMemcpyAsync

In conclusion, if in the code there is not any instruction like ‘cudaStreamCreate’ and ‘cudaMemcpyAsync’ you cold assume that your code is synchronous (simplifying the measurements).

Measuring with the cut{Start|Stop}Timer

It is very important to use the instruction cudaThreadSynchronize() to avoid erroneous measurements.

The code is bellow:

 uint kernelTime;
  cutCreateTimer(&kernelTime);
  cutResetTimer(kernelTime);

  cudaMalloc((void **) &a_d, size);   // Allocate array on device
  cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
  cutStartTimer(kernelTime);
  // Do calculation on device:
  square_array <<< n_blocks, block_size >>> (a_d, N);
  cudaThreadSynchronize();
  cutStopTimer(kernelTime);

  // Retrieve result from device and store it in host array
  cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
  cudaFree(a_d);
  printf ("Time for the kernel: %f ms\n", cutGetTimerValue(kernelTime));

The output:

[ivan@machine]$ ./timer
Device name : Tesla C2050
Time for selecting the device: 3423.731934 ms
Time for the kernel: 0.068000 ms

Measuring with events

The events are more precise and an example of use is bellow:

  cudaEvent_t start, stop;
  float time;
  cudaEventCreate(&start);
  cudaEventCreate(&stop);

  cudaMalloc((void **) &a_d, size);   // Allocate array on device
  cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
  cudaEventRecord(start, 0);
  // Do calculation on device:
  square_array <<< n_blocks, block_size >>> (a_d, N);
  cudaEventRecord(stop, 0);
  cudaEventSynchronize(stop);

  // Retrieve result from device and store it in host array
  cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
  cudaFree(a_d);
  cudaEventElapsedTime(&time, start, stop);
  printf ("Time for the kernel: %f ms\n", time);

The ouput:

[ivan@machine]$ ./event
Device name : Tesla C2050
Time for selecting the device: 3819.466064 ms
Time for the kernel: 0.041632 ms

In short

If the code is asynchronous you could use any of the two ways introduced before. If the kernel is a bit more complex the differences between them are smaller:

[ivan@garoe tiempo]$ ./timer
Device name : Tesla C2050
Time for selecting the device: 3824.726074 ms
Time for the kernel: 1.619000 ms

[ivan@garoe tiempo]$ ./event
Device name : Tesla C2050
Time for selecting the device: 3859.903076 ms
Time for the kernel: 1.600992 ms