This article reviews C/C++ software profiling tools for the Raspberry Pi card computer. This topic came up as a spin-off of another article about parallel computing in embedded mobile devices, so additional focus is on the tools’ suitability for profiling multi-threaded parallel algorithms.
Indeed, while preparing another article about optimizing software with OpenMP parallel computing, I came across a question about what free software profiling tools are available for profiling C++ software in Raspberry Pi (and Pi 2) platform.
There are several free profiling tools available for Raspberry Pi / Rasbian Linux, yet when having additional interest in profiling parallel computing algorithms, it turned out that every tool do not operate correctly in multi-thread / multi-core parallel algorithm use. Let’s thus review what profiling tool alternatives are available and what are their main characteristics.
Introduction to software profiling
Obvious approach for measuring and comparing software performance under specific algorithm variations and/or parameter conditions is to run the software through some specific benchmark scenario and measuring how much the total CPU time was required. Shorter the CPU time duration, usually better the performance.
However, looking at overall CPU execution times doesn’t usually provide detailed enough insight into how the execution time divides within the program, or which parts of the software to improve to make it run faster.
The most useful tool for software optimization process is thus a software profiling tool that allow identifying how much CPU time get spent in each specific piece of the software code.
Profiling typically works by running together with the examined software and storing frequent “runtime samples” during the program execution about which part of the software is currently executing. Runtime sampling can occur based on fixed timing intervals (e.g. once in millisecond), or upon certain occurrence in the software such as subroutine calls, or even be based on simulating execution of the software in a virtual machine.
Recording a large enough set of execution samples will allow generating statistics about how much of execution time were spent in each function.
Profiling tools for Raspberry Pi / GNU environment
Profiling tools are often specific for certain compiler toolchains, and sometimes even included with the compiler toolkit itself. Several free alternatives are available for C++ software profiling in Raspberry Pi / GNU environment:
- gprof – GCC compiler’s built-in profiling tool
- gperftools – Google Performance Tools
- valgrind – a profiling tool based on CPU simution
Let’s have a closer look at these offerings!
The first obvious profiling tool alternative to look at in Linux environment is the classic gprof included in the GCC compiler toolkit.
Availability is obvious advantage; it’s included in gcc so it’s likely available where gcc is available. This is the case also with Raspbian; gprof is available as far as you install the gcc compiler.
gprof is also relatively easy to use and involves reasonably small overhead to performance during profiling.
gprof usage example
There are couple of remarks for building a C/C++ compilation for gprof profiling:
- Compile and link the software with -pg switch to enable gprof profiling mode
- Add -all-static switch to force the linker to use static library linkage.That is because profiling analysis will cover only the routines included in the main executable file, so library functions loaded dynamically during program execution will not get included in the profiling results.
Using the same software example case as presented in the parallel computing article, you can build the gprof profiling compilation of SoundTouch example software with the following command:
make –j CXXFLAGS=-pg LDFLAGS=-all-static
To perform the profiling run, execute the software as usually. In this case of this SoundTouch example we can use e.g. following command line with desired processing parameters:
time ./soundstretch test.wav /dev/null –pitch=-0.318
This will now execute the software subject to benchmarking and produce profiling information into file “gmon.out”.
Notice about using libraries with gprof: In above we built the software with all-static library linking to allow profiling also the library functions, so the plain shell command to execute the program subject to profiling works.
In case that the executable would use dynamically linked libraries, then it’d be necessary to invoke gprof through libtool e.g. as follows:
libtool –mode=execute gprof ./soundstretch
Once execution of the software subject to profiling finishes, let’s generate profiling results analysis by running the gprof tool:
This prints an output listing about execution times divided by function, which looks for example as below. The function that used most CPU time is listed at top of the list, then comes the second-most-CPU-intensive function etc. In this example case the most intensive function TDstretch::calcCrossCorr used 77.17 seconds of time in total, which contributed to 63.15% of overall execution time:
Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 63.15 77.17 77.17 10660300 0.01 0.01 TDStretch::calcCrossCorr 27.28 110.51 33.34 13192 2.53 2.53 FIRFilter::evaluateFilterStereo 3.42 114.69 4.18 13192 0.32 0.32 InterpolateCubic::transposeStereo 1.54 116.57 1.88 35104 0.05 0.05 WavOutFile::write 1.40 118.28 1.71 memcpy 1.39 119.98 1.70 13125 0.13 0.13 WavInFile::read 1.01 121.21 1.23 10990 0.11 7.13 TDStretch::seekBestOverlapPosition 0.40 121.70 0.49 read 0.12 121.85 0.15 10990 0.01 0.01 TDStretch::overlapStereo
non-starter for parallel algorithm profiling
Alas, gprof turns out to have a remarkable disadvantage for parallel program optimization in that does not have a thread-safe implementation. Depending on the environment and how the profiled software uses threads, gprof can thus report invalid function call counts for multi-threaded applications.
That is the very case with tuning algorithms with OpenMP optimizations in Raspberry Pi environment: The gprof tool can initially profile the original single-threaded SoundStretch software quite properly, but when profiling a version with OpenMP optimizations added, the gprof profiler fails and reports heavily biased activity in incorrect functions.
As summary, gprof can be a handly tool for detecting performance hot-spots in single-threaded applications and it could be used for initial performance analysis before OpenMP optimizations if any other profiler tools were not available, but it’s pretty useless for profiling already optimized parallel programs.
Let’s thus look further!
Gperftools is a nice and fast profiling tool provided by Google. Gperftools operate by time-based sampling and can profile correctly also multi-threaded applications, so it’s a good match with this OpenMP inspection.
The Gperftools can be installed to Rasbpian with following command:
apt-get install google-perftools libgoogle-perftools-dev
Notice: If your Raspbian is based on the Wheezy release, it may be necessary to switch Raspbian apt source to more recent Jessie release as the older Wheezy release doesn’t contain all required libraries: Missing library information can cause profiling tool to output hexadecimal instruction addresses instead of properly formatted function names.
If you face such issue, then change the apt source release to Jessie by editing file “/etc/apt/sources.list” and performing a full apt-get upgrade. This at your own risk of course!
Using Gperftools will require building a special compilation of the profiled software so that:
- Insert function calls ProfilerStart(“soundstretch.prof”) and ProfilerStop() around the profiled sections within the profiled program.In case of the parallel programming article example, these functional calls were inserted into SoundStretch main.cpp, just before and after the main processing loop. This implies adding a #include “gperftools/profiler.h” at beginning of the related source code file of course.
- Include additional profiler library into the build, and force static linking of libraries with switch -all-static. If static linking were not used, then the functions loaded from dynamically linked libraries would not get listed among the profiling results. Again In case of the SoundTouch software example, you can create such build this with the following command:
make –j LDFLAGS=”-all-static –lprofiler”
After executing the software subject to profiling, generate the profiling analysis results by running script “google-pprof”, for example:
google-pprof –text ./soundstretch soundstretch.prof
The output looks as below in this example case. Again, the most intensive function is at top of the list, then comes the next-most-intensive function etc. Each row lists details about how many samples (0.001 seconds each) were detected within the given function and how much that represents of the total execution time:
Total: 12183 samples 7713 63.3% 63.3% 7713 63.3% soundtouch::TDStretch::calcCrossCorr 3366 27.6% 90.9% 3366 27.6% soundtouch::FIRFilter::evaluateFilterStereo 399 3.3% 94.2% 399 3.3% soundtouch::InterpolateCubic::transposeStereo 182 1.5% 95.7% 182 1.5% WavInFile::read 175 1.4% 97.1% 175 1.4% WavOutFile::write 155 1.3% 98.4% 155 1.3% memcpy 106 0.9% 99.3% 106 0.9% soundtouch::TDStretch::seekBestOverlapPositionFull 51 0.4% 99.7% 51 0.4% read
Gproftools reports valid results also for multi-threaded parallel algorithm. Being a light-weight and relatively easy to use tool, it thus became the tool of choice for the referred embedded OpenMP parallel programming examination!
Valgrind is a sophisticated performance and memory analysis tool that simulates the software execution in a virtual processor and thus can provide an in-depth analysis of what is happening within the software in processor even in hardware level.
Despite being quite impressive technology, Valgrind however has couple of pitfalls in Raspberry Pi environment:
- Due to virtual machine simulation approach, Valgrind runs much-much slower than realtime execution. Raspberry Pi isn’t particularly a stellar rocket, so the execution performance will become several tens of times slower than real time. It depends on your mileage of course, but slowdown of this scale may turn it unfeasible for practical use.
- In case of OpenMP optimized parallel algorithm, Valgrid reports again rather biased results, indicating that very remarkable share of time were being spent among OpenMP overhead functions vs. time spent in the actual algorithms. This issue is likely due to virtual machine simulating “parallel” execution by running the threads sequentially, one thread at a time in a single virtual processor.
These features made Valgrid unsuitable at least for OpenMP parallel optimization use.
If you however want to give it a try, you can install it with following commands:
sudo apt-get install valgrind kcachegrind
Oprofile is not available as ready package for Raspbian, but can be acquired and compiled from source code distribution after installing few prerequisite packages (libpopt-dev, binutils-dev, libiberty-dev). Installing these packages again required switching the APT package tool to use Jessie instead of Wheezy repository.
However, even if Oprofile v1.0 compiles succesfully in Raspbian, it immediately reports that it is not compatible with the Raspberry CPU version or kernel. Getting it to work would obviously require at minimum a kernel recompilation, and perhaps it still might not work in Raspberry Pi environment.
Whichever way, it clearly is not straight-forward to get it working in Raspberry Pi, so let’s deem it as not a tool of choice for faint-hearted users.
Of course, please feel free to report if you disagree with this statement and have got it working 🙂
Four profiling tools for Raspberry Pi / Pi 2 running Raspbian were reviewed with following conclusions:
- gperftools – Lightweight and easy-to use tools that report correct results also for multi-threaded parallel algorithms. Clearly the tool of choice for the related OpenMP parallel optimization exercise!
- gprof – GNU classic that is it’s available with gcc compiler tools. gprof is useful for usual single-core algorithm execution analysis, but it is not thread-safe in Raspberry Pi/Raspbian environment so it reports corrupted results for multi-threaded parallel algorithms. gprof might yet be used for initial detection of hot-spot routines prior to parallel optimization where other tools are not available.
- valgrind – sophisticated in-depth execution analysis tool that works by simulating virtual processor. This approach makes it however very slow in Raspberry Pi environment. The virtual simulation is also done for a single-core processor so Valgrind does not produce realistic profiling figures for multi-thread parallel algorithms
- oprofile – we got this tool compiled for Raspberry Pi but it does not support the Raspberry CPU or Raspbian kernel, so it was left as bit of mystery if it’s even theoretically feasible tool for Raspberry Pi use
ps. profiler tools for other environments
Besides the most prominent open-source profiling tools for Linux-based Raspberry Pi considered above, plethora of other profiling tools exists also for other CPU/OS environments with varying licensing models. Please see Wikipedia for a comprehensive list of profiling tools for different platforms.
Just to suggest few good alternatives for Window environment,
- Microsoft Visual Studio features an integrated profiling tool that can come in handy when developing software in Windows platform. Please notice that profiling tool support requires a Visual Studio Professional or Enterprise version. It is not included in the free Visual Studio Express edition.
- Other venerable performance analysis tools for Windows platform are Intel VTune and AMD Codeanalysts. Both of these are also available for x86 Linux.