Seminarium: Systemy Rozproszone
6 pazdziernika 2022, godzina 12:15, sala 4070 i transmisja online

Karol Waszczuk

Practical and efficient scheduling algorithm for supercomputers utilizing burst buffers



Limited I/O throughput is an ongoing problem in High Performance Computing (HPC), as it significantly restraints ever-growing computing capabilities of supercomputers. The burst buffer technology emerged in the early 2010s to address this issue and has become more prevalent in recent years. Burst buffers extend the supercomputer's memory hierarchy, with an additional intermediate memory tier backed by Solid-State Drive (SSD) and Non-Volatile Memory Express (NVMe) devices meant to improve I/O performance.

Despite the great potential of the burst buffer concept in improving HPC performance, so far, not much publicly-available work has been done towards enhancing existing scheduling algorithms and software to support this new technology. As part of our research, we present how an imprecise adoption of burst buffer technology into scheduling software may lead to job delays and starvation, by reviewing a burst buffer integration in the scheduling component of the Slurm workload manager.

As the main subject of our research, we propose a practical, efficient, and starvation.free version of the FIFO-backfilling algorithm for HPC systems utilizing burst buffers. Although it takes into account both processors and burst buffer resources, what can be considered as changing the original 1-dimensional scheduling problem into 2-dimensional one, the demon- strated algorithm does not increase computational complexity compared to the original variant of backfill, which handles only a single resource. We integrated the proposed algorithm into Slurm and assessed its performance by emulation using a historic HPC workload.

Our results show that, besides avoiding job starvation, the proposed algorithm in numerous scenarios provides noticeably better job scheduling than a baseline algorithm originally used in Slurm. In over half of 16 evaluated workload samples, the newly introduced FIFO.backfilling algorithm decreases mean waiting time and slowdown of jobs by at least 20%, while just in 3 samples performing worse than the baseline.

Zapraszam,
Karol Waszczuk

Bibliografia: