Seminarium: Systemy Rozproszone
24 kwietnia 2025 12:15, sala 4070
Kacper Chętkowski , Filip Głębocki

Using Outer Grid in Halo Exchange procedure of Distributed Data Structures in NUMA domain architectures.

Halo exchange is a commonly used technique used in stencil computations when data structure, typically an N-dimensional array, is split across devices. Such computations are commonly split into multiple computational and communication steps, where after each computational step, there is one communication step. One computational step typically requires a certain number of neighbouring cells. During the communication step, neighbouring nodes transfer newly computed information. However, the communication step is usually more expensive than the computational one. In this thesis, we investigate different ways to improve the performance of halo exchange by reducing the number of communication steps but transferring more information in each one. In this way, we plan to achieve smaller synchronization between processes by slightly increasing memory costs and performing redundant computations on a so-called "Outer Grid”. This thesis will consist of the efficient implementation of Halo exchange, which extends Distributed Ranges Library data structures. This should allow the data structure to be used with all algorithms supporting its concepts-based architecture. Performance will be measured by running strong and weak scaling tests of Stencil Computations using GPU-based NUMA architecture. We plan to compare the performance when using different sizes of Outer Grid to computations without the Outer Grid.

Zapraszam,
Kacper Chętkowski

Optimizing the Halo Exchange procedure for the Distributed Ranges library by parallelizing computation and communication

High Performance Computing (HPC) is pivotal in advancing computational capabilities across various domains. By leveraging efficient resource sharing and parallel processing, HPC technologies attempt to solve complex problems that require significant computational power. This work will provide an early evaluation of an experimental extension of Intel's Distributed Ranges library. Data structures distributed across multiple devices play a significant role in HPC. An important operation defined for some such structures is Halo Exchange, a technique used for sharing contents of said structures between devices. This procedure is commonly split into multiple interleaved steps involving computation and communication. Currently each of the GPUs owns exactly one segment of the distributed data structure and after the computational step, the GPUs which need to communicate with each other are idle and don't perform any meaningful computations, which wastes time and hurts scalability of the whole system. This work focuses on mitigating this problem for multi-GPU NUMA architectures. First we propose a different distribution of data between GPUs using more segments per processing unit. We can then say that each device has a "left" and "right" segment to work on. While the devices perform computations on their left segments, the communication step for the right segments happens in the background, and vice-versa. This way, communication incurs less or even no idle time for GPUs and the whole system becomes more performant. This thesis consists of an implementation of the new Halo Exchange procedure together with new segments distribution which optimizes the algorithm in the aforementioned way. My work allows it to be used with all algorithms supporting its concept-based architecture. Its performance has been measured by showing its strong and weak scaling properties using GPU-based NUMA architecture.

Zapraszam,
Filip Głębocki