AI agents that autonomously select and invoke tools are becoming ubiquitous—yet their decision-making remains a black box. When an agent chooses to query a database instead of searching the web, what internal representations drive that choice? Current approaches rely on prompt engineering, behavioral testing, or post-hoc explanations, none of which reveal the model's actual computational mechanisms.
In this talk, I introduce Kiji Inspector, an open-source framework that applies mechanistic interpretability using Sparse Autoencoders (SAEs) to produce interpretable decision factors that explain why an agent chose a specific tool. The framework provides a complete pipeline spanning three key steps:
- Capturing model activations precisely at the moment of tool commitment
- Training SAEs to decompose these high-dimensional activations into monosemantic features
- Applying contrastive analysis to isolate the truly decision-relevant dimensions from incidental noise
A rigorous evaluation framework, including token-level fuzzing, verifies that our explanations pinpoint correct causal mechanisms rather than surface-level correlations.
I will walk through the full methodology with worked examples, demonstrating the pipeline in action. I will also share critical lessons learned along the way: the surprising efficacy of contrastive pairs as post-hoc probes, and the necessity of token-level validation to avoid the common pitfall of treating feature labels as ground truth. Attendees will leave with both a conceptual framework and a practical, open-source toolkit for bringing transparency to agentic AI systems.
Interview:
What is the focus of your current work?
The goal of this session is to show attendees that we need to move beyond "what did the agent do" toward "why the agent took this decision." My work at Dataiku is to build mechanisms to help do that. I have been building an end-to-end mechanistic interpretability pipeline that takes raw agent interactions through contrastive pair generation, activation extraction, SAE training, and automated feature labeling, culminating in validated explanations of tool-selection decisions. The system currently processes 500K+ contrastive pairs across multiple domains (tool selection, investment analysis, manufacturing, supply chain, customer support) using a NVIDIA’s Nemotron3 30B-parameter subject model and Qwen’s 235B-parameter models for data generation and evaluation. "
What is the motivation behind your presentation?
Agents are being deployed in high-stakes contexts, financial analysis, manufacturing control, customer support, yet we have remarkably little insight into *how* they make decisions. I see teams debugging agent failures through trial-and-error prompt modifications, treating the model as an oracle rather than a computational system with inspectable internals.
The interpretability community has powerful tools for understanding language model internals, but most work focuses on next-token prediction in isolated contexts. Applying these techniques to agentic decision-making,where a model must integrate context, compare tool affordances, and commit to an action,requires new methodologies. I want to share a practical, reproducible approach that others can adapt, along with hard-won lessons about what works (contrastive pairs as post-hoc probes) and what doesn't (assuming feature labels are correct without token-level validation).
What are you hoping the audience walks away with after your talk?
- Contrastive dataset construction: Practical guidance on building training datasets of contrastive pairs that isolate the subtle request-level differences (e.g., "look this up internally" vs. "search the web") that drive an agent's tool choice.
- Decision token extraction: Identify how and when a model decides what tool to use, and how this matters for interpretability
- Contrastive activation analysis: Using contrastive pairs to determine themes of the autoencoder vector dimensions (e.g., dimension 0 represents a request for external web searches)
- Token-level fuzzing for explanation validation: A rigorous evaluation framework (adapted from Eleuther AI's autointerp) that tests whether feature labels identify the correct tokens that activate features, catching explanations that are "right for the wrong reasons".
Who is your talk for?
- Primary audience: ML engineers and researchers building or evaluating AI agents, particularly those working on tool-use, function calling, or agentic systems, who need to understand or debug agent behavior beyond behavioral testing.
- Secondary audience: AI Safety practitioners concerned with understanding and auditing autonomous agent decisions, and infrastructure engineers managing multi-model inference pipelines.
Speaker
Hannes Hapke
Head of 575 Lab @Dataiku, Google Developer Expert for ML/AI, Member of the Google Developer Board, Previously Principal ML Engineer @Digits
Hannes Hapke leads 575 Lab, Dataiku's Open Source Office, specializing in responsible AI, privacy, and ML explainability. He's a Google Developer Expert for ML/AI and a member of the Google Developer Board. Previously a Principal ML Engineer at Digits for 5 years, he focused on applying ML to boost accountants' productivity. Hannes is the co-author of four ML books, including Generative AI Design Patterns and Machine Learning Production Systems.