This talk covers methods of evaluating AI Agents, with an example of how we built evaluation frameworks for a user-facing AI Agent system that has been in production for almost two years. We share a centralized testing framework that can be reused across different GenAI products and quickly customized. We cover the tools and frameworks we use (and tradeoffs we considered), including LLM-as-Judge, rules-based evaluations, and ML metrics, and how to choose among them. Attendees will leave with concrete patterns for test dataset creation, tracing and evaluation for live apps, offline evals, scoring and roll-up, and using evaluations to improve their software.
Speaker
Susan Chang
Principal Data Scientist @Elastic - Leading Work on Production-Grade Machine Learning
Susan Shu Chang is a Principal Data Scientist at Elastic, where she leads work on production-grade machine learning, including generative AI and agentic workflows. She is also an internationally recognized speaker, O'Reilly author, and keynote presenter at major global technical conferences.