KV cache is the hidden lever behind inference cost and performance. It directly impacts GPU utilization, throughput, and Time to First Token.
This session explains how KV cache works, where it resides across the memory hierarchy, how and when it spills, and how cache aware routing reshapes system design. We will also explore disaggregated prefill, why separating prefill and decode changes fleet architecture, and how these choices influence utilization at scale.
We will walk through vLLM and LMCache and do live tuning to improve performance in real time.
You will leave with a concrete performance model for serving LLMs efficiently and at scale.
Speaker
Khawaja Shams
Co-Founder & CEO @Momento, previously @NASA and @Amazon
Khawaja Shams is a long-time QCon advocate and distributed systems engineer. He received the NASA Early Career Medal for his work on the Mars Rovers, contributing from onboard cameras to the image processing pipeline.
At Amazon Web Services, he led DynamoDB and later served as VP of Engineering for AWS Elemental. Today, he is Co Founder and CEO of Momento, a Series A startup focused on high performance data infrastructure, and an active contributor to the Valkey open source ecosystem.