Profile Applications in CUDA


In this post, I will introduce how to use the tool nvprof to profile your CUDA applications.


It is a good practice to dive deeper to see how much time each kernel or each CUDA runtime API takes when you want to optimize your applications.


It is not good to use any external timing framework since on the host side, what you measure is the time that the host lunches device functions. Lunches are asynchronous so that to measure the device execution time you have to force to synchronize and make sure the kernels has finished at the point you do measurements. However, since there are streams. Those synchronizations may affect your actual performance.

Actually, Nvidia has provided official tools to do applications profiling. The nvprof is a light-weight profiling tool which enables you to collect and view profiling data from the command-line.

In the next part of this post, I will introduce the basic usage of nvprof.

Basic Usage

Say I want to profile my kernel networks inference application.

According to this post, you have to switch to admin.

root@zhoun14: sudo su

You may need to manually find your CUDA nvprof. Simply call nvprof ./myApp to start the profiling.

root@zhoun14:# /usr/local/cuda-10.1/bin/nvprof ./cmake-build-debug/run_inferencer -m FR.onnx -o last_layer.txt
model = FR.onnx
output = last_layer.txt
==15630== NVPROF is profiling process 15630, command: ./cmake-build-debug/run_inferencer -m FR.onnx -o last_layer.txt
==15630== Warning: Profiling results might be incorrect with current version of nvcc compiler used to compile cuda app. Compile with nvcc compiler 9.0 or later version to get correct profiling results. Ignore this warning if code is already compiled with the recommended nvcc version 
==15630== Profiling application: ./cmake-build-debug/run_inferencer -m FR.onnx -o last_layer.txt
==15630== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   93.34%  68.179ms       604  112.88us  1.1200us  16.896ms  [CUDA memcpy HtoD]
                    2.37%  1.7291ms        37  46.731us  6.4000us  170.98us  sgemm_32x32x32_NN_vec
                    1.07%  785.00us        54  14.537us  1.8560us  115.94us  im2col_gpu_kernel
                    0.77%  565.73us       260  2.1750us     768ns  12.704us  fill_kernel
                    0.62%  455.17us         8  56.896us  53.761us  66.496us  maxwell_sgemm_128x64_nn
                    0.57%  414.82us        79  5.2500us  1.6960us  34.241us  void fast_batchnorm_kernel<float>(float*, float*, float*, float*, float*, float*, float, unsigned int, unsigned int, unsigned int)
                    0.49%  354.40us         7  50.628us  5.9520us  62.977us  sgemm_32x32x32_NN
                    0.24%  173.31us         1  173.31us  173.31us  173.31us  void gemv2T_kernel<int, int, float, float, float, int=128, int=16, int=4, int=4, bool=0, cublasGemvParams<cublasGemvTensorStridedBatched<float const >, cublasGemvTensorStridedBatched<float>, float>>(float const )
                    0.18%  133.95us        24  5.5810us  1.7920us  28.576us  zeropad_kernel
                    0.16%  117.22us        24  4.8840us  1.9840us  11.232us  void fast_add_kernel<float>(float**, float*, unsigned int, unsigned int)
                    0.14%  101.60us        25  4.0640us  1.3120us  23.360us  void fast_leaky_relu_kernel<float>(float*, float*, float, int)
                    0.02%  16.096us         1  16.096us  16.096us  16.096us  maxwell_sgemm_128x128_nn
                    0.02%  13.728us         8  1.7160us  1.0560us  6.0160us  [CUDA memset]
                    0.00%  3.2640us         1  3.2640us  3.2640us  3.2640us  void splitKreduce_kernel<float, float, float>(cublasSplitKParams<float>, float const *, float const *, float*, float const *, float const *)
                    0.00%  1.9520us         1  1.9520us  1.9520us  1.9520us  add_bias_kernel
                    0.00%  1.4720us         1  1.4720us  1.4720us  1.4720us  [CUDA memcpy DtoH]
      API calls:   63.75%  349.74ms       235  1.4882ms     762ns  247.18ms  cudaFree
                   22.87%  125.45ms       606  207.01us  1.4810us  109.92ms  cudaMalloc
                   12.60%  69.095ms       605  114.21us  3.2550us  15.990ms  cudaMemcpy
                    0.39%  2.1482ms       522  4.1150us  3.1760us  33.269us  cudaLaunchKernel
                    0.25%  1.3453ms       285  4.7200us      95ns  201.08us  cuDeviceGetAttribute
                    0.05%  292.69us         3  97.563us  95.337us  101.15us  cuDeviceTotalMem
                    0.04%  214.00us      2046     104ns      76ns     623ns  cudaGetLastError
                    0.03%  156.25us         3  52.082us  50.296us  53.649us  cuDeviceGetName
                    0.01%  53.811us       467     115ns      84ns     402ns  cudaPeekAtLastError
                    0.01%  27.846us         8  3.4800us  2.8330us  6.8830us  cudaMemsetAsync
                    0.00%  20.851us        55     379ns     308ns  1.1510us  cudaGetDevice
                    0.00%  11.753us         9  1.3050us     998ns  2.3830us  cudaEventQuery
                    0.00%  11.078us        16     692ns     364ns  5.4220us  cudaEventCreateWithFlags
                    0.00%  7.5690us         9     841ns     664ns  1.6170us  cudaEventRecord
                    0.00%  3.6160us        11     328ns     211ns     936ns  cudaDeviceGetAttribute
                    0.00%  2.5050us         1  2.5050us  2.5050us  2.5050us  cudaSetDevice
                    0.00%  2.4690us         1  2.4690us  2.4690us  2.4690us  cuDeviceGetPCIBusId
                    0.00%  2.4460us         5     489ns     131ns  1.2070us  cuDeviceGetCount
                    0.00%  1.4280us         4     357ns     209ns     634ns  cuDeviceGet
                    0.00%  1.1030us         2     551ns     548ns     555ns  cuInit
                    0.00%     772ns         2     386ns     351ns     421ns  cuDriverGetVersion
                    0.00%     513ns         3     171ns     167ns     176ns  cuDeviceGetUuid

We have the following observations:

  1. The most expensive activity is memcpy from host to device. Since we have to copy all the weights of kernel networks from cpu to the global memory in GPU;
  2. cudaFree and cudaMalloc dominates the process. It takes nearly one thousand times longer time than the memcpy;
  3. The convolution kernels gemm and im2col takes longer time compared to other operations, which is a common sense.


nvprofile manual
Nvidia tips

Leave a Reply

Your email address will not be published. Required fields are marked *