Spotify's advertising platform spans audience targeting, ad creative generation, campaign goal resolution, budget optimization, and ad campaign recommendations -- workflows that touch different data sources, APIs, and business rules. A single monolithic agent trying to handle all of this produced slow, inconsistent, and hard-to-debug results.

We solved this by decomposing the problem into a multi-agent pipeline: specialized agents for audience resolution, advertiser objective mapping, ad script generation, and campaign optimizations optimization that run in parallel, share session state, and compose into sequential stages -- built on Google's Agent Development Kit (ADK) for Java and Vertex AI. The result has opened the door for new advertising use cases to be built as composable agents on the same platform.

This talk is a pattern catalog drawn from that production system, covering the three architectural problems we had to solve -- and that every team decomposing a domain workflow into agents will face:

Where to draw the agent boundary. Not everything should be an agent. We cover the decision framework we use: agents own reasoning (mapping an advertiser's brief to campaign objectives, interpreting audience descriptions into targeting parameters, generating ad scripts from brand guidelines), tools own data access (looking up geographic targets, searching ad categories, fetching historical performance), and application code owns deterministic logic (scoring recommendations, allocating budgets, validating constraints). We show how parallel and sequential composition primitives wire agents into a pipeline where one agent's output feeds the next agent's instruction context.
Why tool design is a first-class architectural concern. Tools are not just API wrappers -- they are the interface between the LLM and your existing systems, and their design is prompt engineering. We walk through concrete examples: how the same upstream Ads API is exposed through different tool schemas by different agents because each agent has different reasoning needs; how @Schema descriptions and structured error messages enable LLM self-recovery without retries; and the decision rule for when a tool delegates directly to a shared API client vs. to a private use-case class that orchestrates multiple clients. This grounding in real API data is what eliminates hallucination at the source.
How to make agents reliable through tightly coupled evaluation and tracing. Traditional testing fails for probabilistic agents, and bolting on evaluation as an afterthought leaves blind spots. We show how we built evaluation directly into our agent tracing pipeline so that every production execution can be replayed, inspected, and scored. The eval model operates at three levels: tool trajectory scoring (did the agent call the right tools in the right order?), deterministic response assertions (do critical output fields match expectations?), and LLM-as-judge scoring for semantic quality dimensions that deterministic checks cannot capture. By tying eval tightly to production traces, we catch regressions from prompt changes, model updates, and tool modifications before they reach advertisers -- deterministic checks gate CI, while judge-based scoring runs in nightly experiments for longitudinal quality tracking.

Attendees will leave with reusable decomposition patterns, tool design heuristics, and an evaluation strategy they can apply regardless of whether they use ADK, LangGraph, Semantic Kernel, or a hand-rolled orchestrator.

Multi-Agent Patterns from Spotify’s AI Powered Advertising Platform

Speaker

Pratik Rasam

Speaker

Pratik Rasam

Date

Location

Share

InfoQ Resources

Social Media Links

Conference

Helpful Resources

InfoQ & QCon Events