Different approaches for profiling your CUDA application

If GPUs are so amazing new stuff, why is everybody not developing software for this architecture? There are some reasons, for example, not all applications should be offloaded into a GPU. The first question to think about is the possibility of parallelizing the code. If the answer is affirmative it could be a good starting point. However, some problems arise in this first step. Transferring data between GPU and CPU is quite slow so depending on how heavy the computations are against the transfers you are going to speed up your code or not.

However, the main difficulty that remains is the complexity of GPU developing. During the first years, when this technology was immature, the GPU works as a black box for the programmer. Data was transferred, the kernel was executed and the developer had to take a look to the output to try to figure out what was happening in the GPU. When the output was wrong he had to start a very unpleasant work of debugging.
Cuda-gdb was almost the only tool for helping in such tasks but had a high learning curve. Nowadays tools are more helpful and with an intuitive GUI like Allinea DDT’s.


Correctness is the first challenge that a GPU developer has to face but the application could be slower than the CPU version. In that moment, efficiency is the key of the future work. We should know which parts of the code are more time consuming, how much time is spent in transfers and computing… And that is profiling.

First approach

Nvidia provides two simple ways of measuring time in our applications: timers and events. They are quite similar and easy to understand as shows the following example:

  cudaEvent_t start, stop;
  float time;

  cudaMalloc((void **) &a_d, size);   // Allocate array on device
  cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
  cudaEventRecord(start, 0);
  // Do calculation on device:
  square_array <<< n_blocks, block_size >>> (a_d, N);
  cudaEventRecord(stop, 0);

  // Retrieve result from device and store it in host array
  cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
  cudaEventElapsedTime(&time, start, stop);
  printf ("Time for the kernel: %f ms\n", time);

The ouput:

$ ./cuda_app
Time for the kernel: 0.041632 ms

The main pro about using this approach is its simplicity. You don’t need any other tool. In the other hand you have to create graphs by your own to show the results. This task is time consuming and, being frank, quite boring.

Second approach

The Barcelona Supercomputing Centre has developed some profiling tools: Extrae and Paraver. The first one lets you measure time in different points of the application and the second one is focused in data presentation. The following lines give an example of Extrae’s callings:

for (i = 1; i <= MAX_ITERS; i++)
Extrae_event (1000, i);
[original loop code]
Extrae_event (1000, 0);

You have a lot of different calls for creating event types, start and stop the profiling,  network and task monitoring (MPI)…  You could use a XML configuration file with different types of measurements. Extrae’s output file can be used as the Paraver’s input as you can see in the following figure:


Using the information shown in the picture you could understand better how the application performance is in a given hardware.

In conclusion, this approach is better than the first one but you still need to modify your code and learn how to use both tools. And I could say that it isn’t a trivial task.

Third approach

The last tool revised in this post is Allinea Map. This tool tries to ease profiling avoiding all the cons shown in the first two approaches.

Using it is as simple as you can see in the following line:

$ map ./cuda_app datafile.in

Map application has a very intuitive GUI that displays a lot of useful information:


Using this tool you could truly understand your application performance. It is important to highlight that, frequently, developers are not always specialised people but scientifics from a wide range of fields: physics, chemistry, biology… For this kind of people Allinea shows its main feature, the mixture of a powerful and an easy-to-learn tool. The next figure could be seen as an illustrative example of the previous words:

In the main panel you could find the source code and it’s annotated with performance percentages for both computation and communication time. Developers could focus in just such functions that are more time consuming.
Another important Allinea Map panel is the one that shows CPU versus GPU computation. Opportunities have arisen to perform future computations  in the CPU instead leaving it idle while the GPU is working.

Charts are not static ones, a developer can zoom in on any part of the chart to view a shorter period of time. He could verify if the application is properly using the hardware. For example, he could look for signs of a cache-related bottleneck.
Summing up, here are the key features of Allinea Map:

  • No need to instrument your code
  • A little runtime overhead (around 5%)
  • An intuitive GUI with low learning curve
  • Easy way to find opportunities to improve your application speed up

As you could see in the previous paragraphs, Allinea Map could become the programmer’s best friend for profiling and tuning GPU applications.