Seminarium: Systemy Rozproszone
5 grudnia 2024 12:15, sala 4070
Szymon Potrzebowski, Mikołaj Wasiak



Load is not what you should balance: Introducing Prequal



Presenting Prequal - a load balancer for distributed multi-tenant systems, that aims to minimize real-time request latency in the presence of heterogeneous server capacities and non-uniform, time-varying antagonist load. It does not balance CPU load, but instead selects servers according to estimated latency and active requests-in-flight (RIF). Authors explore its major design features on a testbed system and evaluate it on YouTube, where it has been deployed for more than two years.

Zapraszam,
Szymon Potrzebowski



Bibliografia:





StreamBox: A Lightweight GPU SandBox for Serverless Inference Workflow



The dynamic workload and latency sensitivity of DNN inference drive a trend toward exploiting serverless computing for scalable DNN inference serving. Usually, GPUs are spatially partitioned to serve multiple co-located functions. However, existing serverless inference systems isolate functions in separate monolithic GPU runtimes (e.g., CUDA context), which is too heavy for short-lived and fine-grained functions, leading to a high startup latency, a large memory footprint, and expensive inter-function communication. In this paper, we present StreamBox, a new lightweight GPU sandbox for serverless inference workflow. StreamBox unleashes the potential of streams and efficiently realizes them for serverless inference by implementing fine-grain and auto-scaling memory management, allowing transparent and efficient intra-GPU communication across functions, and enabling PCIe bandwidth sharing among concurrent streams. Our evaluations over real-world workloads show that StreamBox reduces the GPU memory footprint by up to 82% and improves throughput by 6.7X compared to state-of-the-art serverless inference systems.

Zapraszam,
Mikołaj Wasiak



Bibliografia: