Tuning and optimization for a variety of many-core architectures without changing a single line of implementation code using the Alpaka library

Matthes, Alexander; Widera, René; Zenker, Erik; Worpitz, Benjamin; Huebl, Axel; Bussmann, Michael

Tuning and optimization for a variety of many-core architectures without changing a single line of implementation code using the Alpaka library

Matthes, A.; Widera, R.; Zenker, E.; Worpitz, B.; Huebl, A.; Bussmann, M.

We present an analysis on optimizing performance of a single C++11 source code using the Alpaka hardware abstraction library.
While in previous work Alpaka showed close-to-zero overhead compared to native implementations and similar relative numerical performance on a variety of many-core platforms, in this work we focus on performance optimization of the general matrix multiplication (GEMM) algorithm using a simple tiling strategy by tuning tile size and number of tiles computed in parallel. In addition we analyze the optimization potential available with vendor-specific compilers when confronted with the heavily templated abstractions of Alpaka.
We specifically tested the code for bleeding edge architectures such as Nvidia‘s Tesla P100, Intel‘s Knights Landing (KNL) and Haswell architecture as well as IBM‘s Power8 system. On some of these we have been able to reach almost 50% of the peak floating point operation performance using the aforementioned means. When adding compiler-specific #pragmas we were able to reach 5 TFLOPs/s on a P100 and over 1 TFLOPs/s on a KNL system.

Keywords: Heterogeneous computing; HPC; C++; CUDA; OpenMP; Platform portability; Performance portability; Parameter tuning

Contribution to proceedings
2nd International Workshop on Performance Portable Programming Models for Accelerators (P^3MA), 22.06.2017, Frankfurt am Main, Deutschland
ISC High Performance 2017: High Performance Computing, Vol 10524, 496-514
DOI: 10.1007/978-3-319-67630-2_36
Cited 11 times in Scopus
Lecture (Conference)
2nd International Workshop on Performance Portable Programming Models for Accelerators (P^3MA), 22.06.2017, Frankfurt am Main, Deutschland

Permalink: https://www.hzdr.de/publications/Publ-25482