Large Language Models are powerful — but deploying them in latency-critical ranking systems is a fundamentally different problem than building chat applications.
At LinkedIn, we scaled LLM-based ranking systems to power high-throughput, low-latency search and recommendation workloads such as Job Search and People Search. These systems must score hundreds of candidates per query under strict latency budgets while serving millions of users globally.
This talk presents how we designed and scaled a prefill-only LLM ranking architecture using SGLang, optimized for throughput, cost efficiency, and predictable latency. I will cover key production learnings, inference optimizations, infrastructure decisions, and the trade-offs required to make LLM ranking viable at scale.
We will also discuss how we upstreamed our production optimizations to the open-source SGLang ecosystem.
Speaker
Sundara Ramachandran
Lead Engineer @LinkedIn on LLM Inference Team, Previously Worked on Azure Identity & Authorization and Microsoft Office
Sundara Raman Ramachandran is a Lead Engineer on LinkedIn’s LLM Inference team, where he plays a key role in designing and scaling the infrastructure powering LLM-based ranking systems for search and recommendation. His work centers on building latency-critical, high-throughput LLM serving platforms that operate reliably at global scale.
He has driven production deployment of prefill-only LLM scoring systems and contributed extensively to the SGLang open-source ecosystem, including leading the development of the Prefill-Only Ranking API informed by real-world, high-QPS production constraints. Sundara has co-authored research accepted to MLSys 2026, with additional work currently under review at KDD 2026.
Prior to LinkedIn, he worked on Azure Identity & Authorization and Microsoft Office. He holds a Master’s degree from The University of Texas at Austin.