Evaluation
How to actually know if your agent is good — regression suites, LLM-as-judge, trace-based eval, benchmark literacy.
Through 2023, “did you eval your LLM application?” was met with awkward silence. By 2026, it’s table-stakes — and there’s a market category of eval tooling (RAGAS, Braintrust, Promptfoo, DeepEval, Phoenix) to prove it.
This module is being expanded with hands-on eval code.
Coming in the next revision:
- Regression tests. Known scenarios with expected outcomes. The first thing to build; it pays for itself within a week of catching a model upgrade regression.
- LLM-as-judge. Use a strong model to grade outputs. Surprisingly aligned with human judgment for many tasks. Limits and biases.
- Trace-based eval. Score each agent step rather than only the final answer. Diagnose where in the loop regressions appear.
- Online eval. Sample production traces, grade them, build a quality dashboard. The continuous-evaluation pattern.
- Benchmarks. SWE-Bench Verified (code agents), GAIA (general assistant), WebArena (browser agents), AgentBench, BFCL (function calling), TAU-bench (customer support). What each measures, where they don’t transfer to your application.
- The eval framework landscape. RAGAS, DeepEval, Promptfoo, Braintrust, Phoenix, OpenAI Evals. When to pick each.
- A worked example: a 20-test regression suite for the module-01 weather agent, run on every change.
Next: Module 10 — Production.