Building Reusable Evaluation Frameworks for Agentic AI Products

This talk covers methods of evaluating AI Agents, with an example of how we built evaluation frameworks for a user-facing AI Agent system that has been in production for almost two years. We share a centralized testing framework that can be reused across different GenAI products and quickly customized. We cover the tools and frameworks we use (and tradeoffs we considered), including LLM-as-Judge, rules-based evaluations, and ML metrics, and how to choose among them. Attendees will leave with concrete patterns for test dataset creation, tracing and evaluation for live apps, offline evals, scoring and roll-up, and using evaluations to improve their software.


Speaker

Susan Chang

Principal Data Scientist @Elastic - Leading Work on Production-Grade Machine Learning

Susan Shu Chang is a Principal Data Scientist at Elastic, where she leads work on production-grade machine learning, including generative AI and agentic workflows. She is also an internationally recognized speaker, O'Reilly author, and keynote presenter at major global technical conferences.

Read more