Decision-Driven Evaluation for Generative AI

This talk starts with an example of an evaluation suite which looked good: it was comprehensive and detailed, but we discovered that it did not provide the clarity we needed to make a launch decision. Using this experience as a starting point, this talk introduces a practical framework for decision-driven evaluation, both offline and online with real user testing.

The core idea is that evaluation should not just produce metrics, which may turn out to be “metrics of convenience”, but should directly inform product and engineering decisions. We will explore how to separate content quality from product impact, how to identify the riskiest uncertainties early, and how to design evaluation systems that scale beyond spot checks and avoid whack-a-mole prompt iteration.

The focus is on building confidence in when to ship, when to invest further, and when to pivot, not just improving scores.

Main Takeaways:

  1. Tie every evaluation to a decision.
  2. Separate content quality from product impact.
  3. Evaluate the riskiest uncertainty first.
  4. Design evaluation infrastructure, not just metrics.

Interview:

What is the focus of your work these days?

I work across a variety of ML projects, including propensity models, static prompt generative AI, and agentic experiences. I provide science guidance to a team of 25 scientists and engineers who develop and deliver Zillow’s market-leading products to real estate agents and mortgage professionals. I also cohost gen AI eval office hours, where teams from across the company come for reviews and advice on their Generate AI evaluation plans.

What is the motivation behind your talk?

I have seen many people implement evaluations by copying "best practices" from a podcast or a peer project, only to discover late in the process that the system they have built does not adequately inform the decision they need to make next. I want to help people start with the decision and use that to derive the most important properties their eval system needs to have.

Who is your session for?

This session is intended for senior engineers, applied scientists, and leaders working on machine learning and generative AI systems in production, or which are intended to go to production soon.


Speaker

Terran Melconian

Principal Applied Scientist - AI @Zillow

Terran Melconian is a Principal Applied Scientist at Zillow, where they lead work on evaluation and quality for generative and agentic AI systems, as well as classical machine learning models. Their focus is building decision-driven evaluation frameworks that connect model behavior to real product outcomes and long-term user impact. Previously, they have held senior technical and leadership roles at Google, TripAdvisor, Meta, and early-stage startups, where they built data science and search teams from zero to production, owned high-availability distributed systems end-to-end, and drove measurable business impact through experimentation and optimization. They hold S.M. and S.B. degrees in Aeronautics and Astronautics from MIT, where their research focused on large-scale simulation of complex systems, and this continues to influence their work on evaluating AI systems operating under uncertainty.

Read more

Date

Monday Jun 1 / 03:40PM EDT ( 50 minutes )

Location

Metcalf Hall Small

Share