Serving LLMs at Scale: The Hidden KV Cache Advantage

KV cache is the hidden lever behind inference cost and performance. It directly impacts GPU utilization, throughput, and Time to First Token.

This session explains how KV cache works, where it resides across the memory hierarchy, how and when it spills, and how cache aware routing reshapes system design. We will also explore disaggregated prefill, why separating prefill and decode changes fleet architecture, and how these choices influence utilization at scale.

We will walk through vLLM and LMCache and do live tuning to improve performance in real time.

You will leave with a concrete performance model for serving LLMs efficiently and at scale.

Speaker

Khawaja Shams

Co-Founder & CEO @Momento, previously @NASA and @Amazon

Khawaja Shams is a long-time QCon advocate and distributed systems engineer. He received the NASA Early Career Medal for his work on the Mars Rovers, contributing from onboard cameras to the image processing pipeline.

At Amazon Web Services, he led DynamoDB and later served as VP of Engineering for AWS Elemental. Today, he is Co Founder and CEO of Momento, a Series A startup focused on high performance data infrastructure, and an active contributor to the Valkey open source ecosystem.

Serving LLMs at Scale: The Hidden KV Cache Advantage

Speaker

Khawaja Shams

Find Khawaja Shams at:

Speaker

Khawaja Shams

Date

Location

Topics

Share

InfoQ Resources

Social Media Links

Conference

Helpful Resources

InfoQ & QCon Events