Top Headlines

Feeds

Learning‑Based Semantic Caching Cuts LLM Inference Costs

Cached

LLM inference cost hampers scalability Large language models deliver powerful capabilities but each inference incurs high computational expense, creating sustainability and scaling challenges for services that rely on them [1].

Exact‑match caches miss semantically similar queries Traditional caching stores responses only for identical prompts, ignoring queries that are different in wording yet convey the same intent, leading to unnecessary recomputation [1].

Semantic caching introduces a novel eviction problem Retrieving cached answers based on semantic similarity requires accounting for mismatch costs between new queries and stored responses, a fundamentally different cache‑replacement consideration [1].

Query arrival probabilities and serving costs are initially unknown Effective cache management must learn these key parameters over time because they are not known a priori in real‑world deployments [1].

Problem is modeled as a combinatorial multi‑armed bandit The authors develop both offline optimization and online learning formulations within this framework, yielding algorithms with provable efficiency and state‑of‑the‑art performance guarantees [1].

Synthetic evaluation shows competitive or superior results Experiments on a generated dataset demonstrate that the proposed methods achieve matching or better performance compared with existing baseline approaches [1].

Links