Profiling Performance of hybrid applications with Score-P and Vampir

Profiling Performance of hybrid applications with Score-P and Vampir

Juckeland, G.; Dietrich, R.


OpenACC aims at providing a relatively easy and straightforward way to describe parallelism for exploitation on platforms with hardware accelerators. It is by design also an approach for porting legacy HPC applications to this novel architecture. Especially such legacy applications, but also newly developed applications that require more resources than a single node can offer, use MPI for inter-node communication and coarse work distribution, thus, becoming so-called hybrid applications. It is also possible to combine OpenACC with OpenMP on the host side to utilize all resources of a compute node or to even use all three levels of parallelism concurrently. Tuning application performance for one parallelization paradigm is challenging, adding the second or third level of parallelism introduces a whole new layer of potential performance problems from the interaction of all parallelization paradigms. It is, however, possible to extend the previously mentioned profile-guided development to also cover this usage scenario.
Profiling tools from compiler or accelerator vendors are usually limited to the scheme the product address, e.g. only OpenACC or CUDA/OpenCL activity. Almost all vendor tools cannot record MPI activity leaving the programmer in the dark how well hybrid applications perform over all used levels of parallelism. Research based performance tools cover this gap. HPCtoolkit, Tau, and Score-P are the most prominent ones that also offer hardware accelerator support. Out of the three Score-P is the one that covers the most parallelization paradigms, can record the most concurrent activity and, as a result, can provide the most complete performance picture even for very complex applications. Therefore, Score-P will be used as the example performance recording tool for this chapter. The other tools can provide similar results. Vampir will be use for visualizing the performance data since it is by far the most capable trace visualizer and profile generator.

  • Book chapter
    Farber, Rob: OpenACC - Parallel Programming with OpenACC, Amsterdam: Elsevier, 2016, 978-0-12-410397-9, 55-68