We started with the obvious stack: a prompt-chain framework, some glue code, a queue for the slow parts. It demoed beautifully. Then real workloads showed up: multi-agent delegation, tool calls that took hours, workflows that needed to survive restarts, users who wanted to ask follow-up questions to a job that was still running. The framework dissolved into a pile of locks, ad-hoc state machines, and recovery code we couldn't trust.
So we threw it out and rebuilt it on the actor model. This talk is the report from the other side: what broke, why, and what the actor model (supervision trees, message passing, isolated state) actually buys you when you take it seriously instead of treating it as a metaphor.
We'll walk through the patterns that survived contact with production: tasks as a universal execution primitive that collapses the split between "chat" and "workflow," write-through state so crash recovery is the same code path as normal resumption, indirect continuation so a chain of delegations doesn't pin a thread, mutable execution graphs that agents extend at runtime, and supervision boundaries drawn where failure actually happens.
If you're somewhere on the curve between "the demo works" and "it's been up for a week," this talk is the shortcut.
Speaker
Manju Rajashekhar
Entrepreneur, Engineering Executive, Founder, Investor
Manju Rajashekhar is a technology leader whose career has been shaped by systems that can't go down. He was part of Twitter's early engineering team, where he led caching systems through their growth from 50M to 200M users. The systems his team open-sourced, twemproxy and twemcache, became industry standards for distributed caching. He co-founded Blackbird AI, a search and deep learning company acquired by Etsy, where he spent eight years leading Search, Ads, Recommendations, Personalization Engine, ML Infrastructure, and Experimentation. Manju is now building his second company, focused on AI agent infrastructure for agents that are long-horizon, durable, and reliable.