PhD (Work in progress):
Over the last decade, general-purpose graphics-processing units (GPGPUs) and field-programmable gate arrays (FPGAs) have emerged as a popular addition to computer systems. Commonly referred to as accelerators, they are a form of component that can be used as an additionally processor. The main argument in favour of such accelerators is their ability to provide higher memory throughput and a much better compute to energy ratio then conventional CPU-based systems alone. These properties make them particularly well-suited for memory-bound applications such as image processing or low-power applications as found in the context of autonomous robots. The main challenge in using such accelerated systems is in how they need to be programmed, which is dictated by the way the hardware components are being integrated. Any computation on an accelerator requires the data to be transferred from the conventional CPU-based host to the accelerator board, the accelerator processor itself, and then back again. State-of-the-art programming frameworks for these devices, such as CUDA and OpenCL, require the programmer to explicitly program these memory transfers. As a consequence, existing programs require massive code refactoring to make use of accelerators. To make matters worse, the performance of accelerated systems is highly sensitive to the way the data transfers between the different levels of memory is orchestrated. Implementing these transfers in an effective way is necessary to achieve the potential performance gains of accelerators. Unfortunately, any such explicit code usually needs to be specifically tuned for the accelerator hardware that is to be used. Such changes, while often minor when moving the program from GPU to another, can become massive when moving between GPUs and FPGAs. The aim of this research is to investigate to what extent the orchestration of memory transfers between host and accelerator can be compiler generated in a target architecture specific way and what efficiency can be achieved from such an approach. As a first starting point, we investigate the effectiveness of various ways of streaming data from host to accelerators and its performance impact on various different hardware accelerators, including GPUs and FPGAs. In a second step we intend to automate the choice of code transformations in a target-specific way. We will use several real world codes to investigate the validity of our assumptions.
My masters project, titled An Investigation into the Performance Portability between Single-Assignment C and OpenCL, builds on the SaC eco-system (www.sac-home.org), a compiler with adjacent toolchain that allows to compile Matlab-like program specifications into codes that can be executed on various many-core systems including graphics processing units (GPUs) that can be programmed using NVIDIA’s GPU programming language CUDA.
Below is the abstract of my submitted thesis:
Parallel computing device like multi-core processors, graphics processing units (GPUs), and many-core architectures are becoming ever more prevalent in accelerating compute intensive applications. As part of a heterogeneous computing system, they provide the potential to improve the performance of applications by targeting specific computations. Given the benefits, the programming model of these devices are often very different from that of conventional CPUs, posing a challenge to develop or port an application to the device. Programming frameworks, like OpenCL, exist which seek to solve this problem by providing a consistent programming interface across computing devices. This has substantially improved the portability of applications. Given this though, consistent performance across different devices is not a given. In some instances developers still need to tweak their applications in order to achieve reasonable performance. Possible solutions exist to this problem, such as Single-Assignment C (SaC), which through the use of compiler technologies can automatically optimise an application for a particular compute device. It is unclear though whether SaC can bridge the gap in performance portability. This is the purpose of this dissertation, to investigate the performance portability between OpenCL and SaC.
- Programming languages
- compiler technologies
- heterogeneous systems
- multi/many-core processors
- high-performance computing