Abstract

ChatGPT has grown from a research preview into one of the largest and most demanding AI applications in production. At that scale, performance engineering is not just about making GPU inference faster. A single user request may involve client work, networking, authentication, CPU-side orchestration, conversation loading, context assembly, tokenization, truncation, model routing, inference, server-side token processing, streaming, and observability.

This keynote shares lessons from scaling ChatGPT’s performance under two simultaneous forces: rapid product growth and rapidly accelerating development velocity. As agentic development with tools like Codex helps teams ship more features, experiments, migrations, and product paths faster, the system also absorbs more logic, more data fetches, more resource consumption, and more opportunities for performance regressions.

The talk will cover how we make performance visible across the full user-facing critical path, why AI application latency is not just a GPU problem, and how performance engineering is evolving for the agentic era. The north star is a performance engineering process that is increasingly operated by agents: richer telemetry, agent-friendly tools, reusable performance skills, Slack-based workflows, automated alert investigation, deeper observability across layers, and agents that can analyze data, inspect regressions, summarize likely causes, and propose fixes.

The goal is not to slow down product development. The goal is to make the performance feedback loop as fast as the development loop.

Main Takeaways:

AI application performance is full critical-path performance.
GPU inference matters, but users wait on the entire product path: client work, CPU orchestration, data fetching, context assembly, tokenization, routing, inference, and streaming.
Agentic development changes the performance problem.
When agents help teams ship faster, performance regressions can accumulate faster too. The issue is not just individual code quality; it is the increased throughput of product and infrastructure change.
The future of performance engineering is increasingly agent-operated.
Performance teams need to move from periodic human-driven investigation to continuous, agent-assisted operations: automated alerts, richer observability, reusable analysis skills, Slack bots, agent-readable tools, and agents that help detect, explain, and fix regressions.

Interview

Interview:

What is the focus of your work these days?

My work is focused on leading ChatGPT Performance at OpenAI: making ChatGPT faster, more reliable, more efficient, and easier for engineering teams to build on. That sits at the intersection of AI infrastructure, cloud infrastructure, platform engineering, performance engineering, reliability, and developer productivity — a thread that runs through my current work at OpenAI, my prior AI inference work as VP of Engineering at Parasail, and my earlier years leading large-scale performance engineering at Netflix.

These days, that work has two connected layers. The first is the applied performance engineering of ChatGPT itself: understanding the full user-visible path across clients, networking, CPU-side orchestration, data fetching, conversation loading, context assembly, tokenization, inference, streaming, and observability. The second is evolving how performance engineering works in the agentic era. As teams ship faster with AI-assisted development, we need performance feedback loops that are faster, more automated, and easier for both humans and agents to use.

A major part of my focus is making performance engineering more agent-friendly: improving observability across layers, exposing tools through stable and machine-readable interfaces, creating reusable performance skills and playbooks, integrating workflows into Slack, collecting richer performance data, and building systems where agents can help investigate alerts, inspect dashboards, analyze profiles, summarize likely causes, and propose fixes. That connects directly to the areas I care about more broadly: agent-driven development, autonomous workflows, skill engineering, platform engineering, observability, profiling, flame graphs, developer experience, and making complex systems easier to operate and improve.

What was the motivation behind your talk?

The motivation is that AI has changed both sides of performance engineering.

On the product side, AI applications like ChatGPT have extremely complex critical paths. The user may think they are waiting for “the model,” but the system often has to reconstruct product state, load conversation context, prepare data, route the request, and manage streaming before the experience feels responsive.

On the engineering side, agentic development is increasing the rate of change. Teams can ship faster, but that also means performance regressions, resource growth, and system complexity can accumulate faster. The old model of performance engineering — humans manually noticing regressions, digging through dashboards, and running occasional cleanup efforts — does not scale to that pace.

This talk is about the performance engineering response: make performance visible, make tools agent-friendly, collect enough data for agents to reason over, and move toward an operating model where agents continuously help detect, investigate, explain, and fix regressions.

Who is your session for?

This session is for senior engineers, staff-plus engineers, engineering managers, architects, SREs, performance engineers, observability teams, AI platform teams, and developer productivity leaders building or operating AI products at scale.

It will be especially relevant for teams that are moving AI systems from prototype to production, adopting coding agents internally, or trying to understand how performance engineering changes when both user demand and software delivery velocity accelerate. No deep ML background is required; the focus is on production systems, scalability, observability, performance engineering, and the operating model needed to keep AI applications fast.

Speaker

Martin Spier

ChatGPT Performance @OpenAI, Engineering Leader, AI and Cloud Infrastructure, Platform Engineering, Performance and Reliability, Ex-Netflix.

Martin Spier leads ChatGPT Performance at OpenAI, where he works on making ChatGPT faster, more reliable, and easier for engineering teams to operate at scale. His work spans AI and cloud infrastructure, performance engineering, observability, reliability, platform engineering, and developer productivity.

Before OpenAI, Martin was VP of Engineering at Parasail, working on AI inference infrastructure, and previously led large-scale performance engineering work at Netflix. He has also created and contributed to open-source performance tools including FlameScope, Vector, multiple performance visualizations, and real user monitoring systems for understanding and optimizing production systems at scale.

Keeping ChatGPT Fast in the Agentic Era

Abstract

Main Takeaways:

Interview:

What is the focus of your work these days?

What was the motivation behind your talk?

Who is your session for?

Speaker

Martin Spier

Find Martin Spier at:

Speaker

Martin Spier

Date

Location

Share

InfoQ Resources

Social Media Links

Conference

Helpful Resources

InfoQ & QCon Events