Alpaka - One Programming Model for Parallel Kernel Acceleration of Heterogeneous Systems

Matthes, Alexander; Zenker, Erik; Worpitz, Benjamin; Widera, René; Huebl, Axel; Juckeland, Guido; Knüpfer, Andreas; Nagel, Wolfgang; Bussmann, Michael

Alpaka - One Programming Model for Parallel Kernel Acceleration of Heterogeneous Systems

Matthes, A.; Zenker, E.; Worpitz, B.; Widera, R.; Huebl, A.; Juckeland, G.; Knüpfer, A.; Nagel, W.; Bussmann, M.

Alpaka provides a uniform, abstract C++ interface to a range of parallel programming models. It can express multiple levels of parallelism and allows for generic programming of kernels either for a single accelerator device or a single address space with multiple CPU cores. The Alpaka abstraction of parallelization is influenced by and based on the groundbreaking CUDA abstraction of a multidimensional grid of blocks of threads. The four main execution hierarchies introduced by Alpaka are called grid, block, thread, and element level. The element level denotes an amount of work a thread needs to process sequentially. These levels describe an index space which is called work division.

Alpaka does not dictate any memory containers nor memory iterators, instead it is based on a simple pointer based memory model, allowing to allocate memory buffers device-wise and copy them between devices. This model provides full power over data structures and their data access and is totally data structure agnostic.

Separating parallelization abstraction from specific hardware capabilities allows for an explicit mapping of these levels to hardware. The current implementation includes mappings to programming models, called back-ends, such as OpenMP, CUDA, C++ threads, and boost fibers. Nevertheless, mapping implementations are not limited to these choices and can be extended or adapted for application-specific optimizations. Which back-end and work division to utilize is parameterized per kernel within the user code.

We have demonstrated platform and performance portability for the DGEMM benchmark, which provides consistently 20% of the theoretical peak performance on AMD, Intel, IBM, and NVIDIA hardware, being on par with the respective native implementations. Moreover, performance measurements of a real world application (PIConGPU, HASEonGPU, ISAAC) translated to Alpaka unanimously demonstrated that Alpaka can be used to write performance portable code.

Keywords: Heterogeneous computing; HPC; C++; CUDA; OpenMP; platform portability; performance portability

Lecture (others)
GPU Technology Conference Europe, 28.-29.09.2016, Amsterdam, Niederlande

Permalink: https://www.hzdr.de/publications/Publ-24633