This talk covers methods of evaluating AI Agents, with an example of how we built evaluation frameworks for a user-facing AI Agent system that has been in production for almost two years. We share a centralized testing framework that can be reused across different GenAI products and quickly customized. We cover the tools and frameworks we use (and tradeoffs we considered), including LLM-as-Judge, rules-based evaluations, and ML metrics, and how to choose among them. Attendees will leave with concrete patterns for test dataset creation, tracing and evaluation for live apps, offline evals, scoring and roll-up, and using evaluations to improve their software.
Interview:
What is your session about, and why is it important for senior software developers?
Evaluating AI features and catching regressions before they reach users is a core part of building trust in your AI products. However, evaluations work differently from typical unit tests. This session covers the building blocks of Agentic AI evaluations, as well as how Elastic internally has created reusable evals that can be applied and customized for different AI products.
What's one thing you hope attendees will implement immediately after your talk?
If you haven't already, implement tracing and monitoring for your production AI workflow, and create or revisit your evaluation datasets.
What makes QCon stand out as a conference for senior software professionals?
I attended QCon SF before as a Track Host and met so many amazing speakers and attendees. I like that the tracks are curated, so that the topics are timely and high quality.
Speaker
Susan Chang
Principal Data Scientist @Elastic
Susan Shu Chang is a Principal Data Scientist at Elastic, where she leads work on production-grade machine learning, including generative AI and agentic workflows. She is also an internationally recognized speaker, O'Reilly author, and keynote presenter at major global technical conferences.