LLM inference cost hampers scalability Large language models deliver powerful capabilities but each inference incurs high computational expense, creating sustainability and scaling challenges for services that rely on them [1].
Exact‑match caches miss semantically similar queries Traditional caching stores responses only for identical prompts, ignoring queries that are different in wording yet convey the same intent, leading to unnecessary recomputation [1].
Semantic caching introduces a novel eviction problem Retrieving cached answers based on semantic similarity requires accounting for mismatch costs between new queries and stored responses, a fundamentally different cache‑replacement consideration [1].
Query arrival probabilities and serving costs are initially unknown Effective cache management must learn these key parameters over time because they are not known a priori in real‑world deployments [1].
Problem is modeled as a combinatorial multi‑armed bandit The authors develop both offline optimization and online learning formulations within this framework, yielding algorithms with provable efficiency and state‑of‑the‑art performance guarantees [1].
Synthetic evaluation shows competitive or superior results Experiments on a generated dataset demonstrate that the proposed methods achieve matching or better performance compared with existing baseline approaches [1].