~60 min read · updated 2026-05-10

Evaluation

How to actually know if your agent is good — regression suites, LLM-as-judge, trace-based eval, benchmark literacy.

Through 2023, “did you eval your LLM application?” was met with awkward silence. By 2026, it’s table-stakes — and there’s a market category of eval tooling (RAGAS, Braintrust, Promptfoo, DeepEval, Phoenix) to prove it.

This module is being expanded with hands-on eval code.

Coming in the next revision:

  • Regression tests. Known scenarios with expected outcomes. The first thing to build; it pays for itself within a week of catching a model upgrade regression.
  • LLM-as-judge. Use a strong model to grade outputs. Surprisingly aligned with human judgment for many tasks. Limits and biases.
  • Trace-based eval. Score each agent step rather than only the final answer. Diagnose where in the loop regressions appear.
  • Online eval. Sample production traces, grade them, build a quality dashboard. The continuous-evaluation pattern.
  • Benchmarks. SWE-Bench Verified (code agents), GAIA (general assistant), WebArena (browser agents), AgentBench, BFCL (function calling), TAU-bench (customer support). What each measures, where they don’t transfer to your application.
  • The eval framework landscape. RAGAS, DeepEval, Promptfoo, Braintrust, Phoenix, OpenAI Evals. When to pick each.
  • A worked example: a 20-test regression suite for the module-01 weather agent, run on every change.

Next: Module 10 — Production.