When we started adopting LLMs across DoorDash, every team was implementing the same infrastructure: retry logic, fallback mechanisms, cost tracking, prompt versioning, and batch processing pipelines. Engineering time was wasted on repetitive plumbing work instead of building features. Teams also made different decisions: some used OpenAI directly, others went through Bedrock, some built custom retry logic that didn't handle rate limits properly, and nobody had consistent observability.

We built a set of platform components to address this: an LLM Gateway for request routing, observability, and fallback handling; a Batch Inference platform for processing large-scale workloads; an Agentic Gateway for multi-step LLM workflows; and ADK (Agent Development Kit) templates to standardize common patterns.

This talk covers the technical details and trade-offs. For the LLM Gateway, we'll discuss how we handle rate limiting across providers with different quota models, cost attribution, and how we implemented prompt caching to improve performance. For Batch Inference, we'll explain our job scheduler design that balances cost optimization with SLA requirements. The Agentic Gateway section covers how we handled streaming protocols like MCP, Auth for internal/external users, state management and scaffolding, which improves the velocity of building agents in DoorDash. We'll also share our decision framework for when to build shared infrastructure versus letting teams own their solutions, and how we measured whether the platform was actually helping or just adding another layer teams had to learn.

Relevant for platform engineers, ML infrastructure teams, and engineering leaders building or evaluating GenAI infrastructure.

Interview:

What is your session about, and why is it important for senior software developers?

This session is about the messy middle between a GenAI demo and a production product.

At DoorDash, our experience has been that the model call quickly became the easy part. The hard part was everything around it: model access, routing, tools, identity, evals, observability, cost attribution, governance, and optimization. We'll walk through the bets we made as the landscape shifted from model calls, to workflows and evals, to agents and tools.

This matters for senior software developers because GenAI platform decisions are becoming architecture decisions. Teams need to decide what product teams should own, what should be centralized, when to buy, when to build, and how to move fast without turning the platform into a reliability or governance bottleneck.

Why is it critical for software leaders to focus on this topic right now?

GenAI is moving faster than traditional platform planning cycles.

Two years ago, most platform questions were about model access. Today, product teams need support across agents, tools, identity, memory, evals, observability, cost, and governance. If leaders don't make explicit bets, they either centralize too early and slow teams down, or decentralize too long and accumulate duplicated work, unowned spend, and security risk.

The leadership challenge is not "adopt AI faster." It is to preserve product-team velocity while building horizontal primitives reliable enough for production.

What are the common challenges developers and architects face in this area?

The first challenge is choosing the right abstraction level while vendors, models, protocols, and agent patterns are all changing at once.

The second is supporting experimentation without letting every team rebuild the same primitives: identity, tools, memory, evals, observability, and cost attribution.

The third is deciding which friction is worth keeping. Product teams want zero-friction onboarding, but platforms still need governance, budget ownership, and accountable cost attribution.

The fourth is knowing when a vendor product is good enough to learn from, but not durable enough to become the long-term platform surface.

What's one thing you hope attendees will implement immediately after your talk?

Create a one-page GenAI Platform Bets document: tenets, bets, failure modes, and signals to watch.

For each bet, write down: the target customer, the tenet behind the decision, the trade-off being optimized, the failure mode you are trying to avoid, and the signal that would tell you whether the bet is working or needs to change.

The point is not that every bet will be right. The point is to make your assumptions visible enough that when new pressure shows up, you can adapt without thrashing.

What makes QCon stand out as a conference for senior software professionals?

QCon is one of the few places where the audience wants the tradeoffs, mistakes, and scars, not just the polished architecture diagram.

That is the right audience for this talk. The useful lesson is not "use a gateway" or "use open-weights." It is how senior engineers and leaders make platform bets under uncertainty, keep product teams moving, and evolve the platform when reality changes. The point is not that every bet was right; explicit bets helped us navigate pressure without thrashing.

Speaker

Siddharth Kodwani

Tech Lead, AI Infrastructure @DoorDash

Siddharth Kodwani is a Software Engineer on the GenAI Platform team at DoorDash, building infrastructure for AI agents that accelerates development velocity and improves production reliability. He has spent the last 10 years building AI/ML platforms at Amazon Prime Video, Zoox, and DoorDash, specializing in the infrastructure that enables teams to ship AI-powered features faster.

Speaker

Swaroop Chitlur

Staff Engineer / Engineering Manager Machine Learning Platform @DoorDash

Swaroop Chitlur leads the Generative AI Platform at DoorDash, building infrastructure for LLM inference, fine-tuning, evals, and AI Agents. He is an engineering leader with 20+ years of experience, including co-founding a hardware startup and being the first backend engineer at Automatic Labs (YC S11, acquired by Sirius XM). Swaroop holds a granted patent and has authored two books -- "A Byte of Python" (10M+ downloads, translated into 10+ languages) and "A Byte of Vim".

Building GenAI Platform at DoorDash

Interview:

What is your session about, and why is it important for senior software developers?

Why is it critical for software leaders to focus on this topic right now?

What are the common challenges developers and architects face in this area?

What's one thing you hope attendees will implement immediately after your talk?

What makes QCon stand out as a conference for senior software professionals?

Speaker

Siddharth Kodwani

Find Siddharth Kodwani at:

Speaker

Swaroop Chitlur

Find Swaroop Chitlur at:

Speaker

Siddharth Kodwani

Speaker

Swaroop Chitlur

Date

Location

Topics

Share

InfoQ Resources

Social Media Links

Conference

Helpful Resources

InfoQ & QCon Events