Boosting scientific computing applications through leveraging data parallel architectures
Autoři
Více o knize
Recent years have imposed major challenges on computational scientists, as features of computer architectures used for numerical simulations are quickly changing. A few years ago, supercomputers had a rather homogeneous design which was sufficiently programmable by using, e. g. MPI or even shared-memory models. This trend was first broken in the mid 2000s with the introduction of multi-core chips. In addition to multi-cores, vendors started to introduce instruction set extensions, suitable for any kind of vector operations, which offer significant compute performance at lower power. In the last five years, accelerators, such as co-processors or GPUs, which are pushing data parallelism to its limits, have become a widely used option in supercomputers. However, these developments increase complexity and heterogeneity rapidly, hence reducing the ease of use of the system, which turns programming at (application-)peak level into a tough job. Tackling this challenge, codes reflecting different algorithmic patterns are investigated in this thesis: starting from kernel functions and library routines from numerical linear algebra (cache-efficient matrix multiplication and LU decomposition (TifaMMy), high-performance Linpack (HPL) and scalable eigenvalue solvers (ELPA)), methods how to program simulation codes in order to use a given hardware as efficiently as possible are derived. Afterwards, these techniques are applied to real-life numerical simulation codes: SeisSol, an earthquake simulation code requiring small and sparse matrix kernels, SG++, a framework for high-dimensional approximations supporting data mining tasks and solving PDEs, and finally ls1 mardyn, a molecular dynamics code. This analysis is performed with a focus on standard x86 CPUs, GPUs, and the Intel MIC Architecture. Through taking especially the latter into account whilst comparing all these architectures, this thesis pioneers an application-driven analysis of emerging multi- and many-core architectures. In any case, the tuning task is not limited to rewriting an application in order to support an actual instruction set, it also includes redesigning an algorithm entirely based on a performance engineering approach. Thereby, always single-digit and sometimes even two- or three-digit speed-ups are possible by switching to hardware-aware implementations. Such improvements allow computational scientists to run either bigger simulations or to achieve shorter turnarounds.