Seminarium: Systemy Rozproszone
5 grudnia 2024 12:15, sala 4070
Szymon Potrzebowski, Mikołaj Wasiak
Load is not what you should balance: Introducing Prequal
Presenting Prequal - a load balancer for distributed multi-tenant systems, that
aims to minimize real-time request latency in the
presence of heterogeneous server capacities and non-uniform,
time-varying antagonist load. It does not balance CPU load, but instead
selects servers according to estimated latency and active
requests-in-flight (RIF). Authors explore its major design features
on a testbed system and evaluate it on YouTube, where it
has been deployed for more than two years.
Zapraszam,
Szymon Potrzebowski
Bibliografia:
StreamBox: A Lightweight GPU SandBox for Serverless Inference Workflow
The dynamic workload and latency sensitivity of DNN inference drive a trend toward exploiting serverless computing for scalable DNN inference serving. Usually, GPUs are spatially partitioned to serve multiple co-located functions. However, existing serverless inference systems isolate functions in separate monolithic GPU runtimes (e.g., CUDA context), which is too heavy for short-lived and fine-grained functions, leading to a high startup latency, a large memory footprint, and expensive inter-function communication. In this paper, we present StreamBox, a new lightweight GPU sandbox for serverless inference workflow. StreamBox unleashes the potential of streams and efficiently realizes them for serverless inference by implementing fine-grain and auto-scaling memory management, allowing transparent and efficient intra-GPU communication across functions, and enabling PCIe bandwidth sharing among concurrent streams. Our evaluations over real-world workloads show that StreamBox reduces the GPU memory footprint by up to 82% and improves throughput by 6.7X compared to state-of-the-art serverless inference systems. Zapraszam,
Mikołaj Wasiak
Bibliografia: