Microsoft studied LLM inference workloads at scale – The authors analyzed Microsoft Office 365 LLM serving, handling over 10 million daily requests across multiple data‑center regions, revealing latency‑sensitive and latency‑insensitive task mixes and diverse SLA demands [0].
Current siloed GPU pools cause under‑utilization – Existing approaches separate fast and slow tasks onto distinct GPU resource pools, which leads to significant idle accelerator capacity because workload loads do not match the fixed allocations [0].
SageServe introduces multi‑timescale control – The proposed framework dynamically routes requests to data centers in the short term while scaling GPU VMs and placing models over longer horizons, using a traffic forecast and an Integer Linear Programming optimizer [0].
Evaluation shows up to 25 % GPU‑hour savings – Simulations and real‑world runs on 10 million production requests across three regions and four open‑source models reduced GPU‑hour consumption by as much as 25 % versus the baseline deployment [0].
Auto‑scaling waste drops 80 %, saving $2.5 M monthly – By cutting inefficient auto‑scaling, SageServe lowered GPU‑hour wastage by 80 %, translating into potential monthly cost reductions of up to $2.5 million while preserving tail‑latency SLAs [0].
Study is among the first public Internet‑scale analyses – This work represents one of the earliest publicly available characterizations of large‑scale LLM serving workloads, offering insights for cloud providers worldwide [0].