Skip to main content
Advanced·1 hour·7 lessons

Evaluating AI Applications

Build eval pipelines that catch regressions before they ship — from code-graded tests and LLM-as-judge to Promptfoo, red-teaming, and production monitoring.

What you'll learn

Why evals matter
Defining success metrics
LLM-as-judge
Ground-truth comparison
Red-teaming & adversarial testing
Regression testing for prompt changes

Your instructor

Dr. Lena Schulz

AI Safety Researcher, Anthropic

Lena works on scalable oversight and Constitutional AI.

Syllabus

01

The Cost of Shipping Without Evals

Why manual spot-checking fails, the eval development lifecycle, and how to define success criteria that actually measure something.

02

Code-Graded Evals

Write deterministic eval functions for classification, extraction, and structured output tasks — the fastest and most reliable evaluation method.

03

LLM-as-Judge

Use a second model to evaluate outputs that require judgment — tone, helpfulness, reasoning quality, and safety.

04

Eval Tooling at Scale

Move from hand-built scripts to production eval infrastructure — Promptfoo, the Anthropic Console, and the Batch API.

05

Red-Teaming and Adversarial Testing

Systematically probe your AI applications for harmful outputs, jailbreaks, data leakage, and failure modes your standard evals won't catch.

06

Evals in Production

Wire evals into CI/CD, monitor live traffic, apply statistical rigor, and build an eval culture that scales with your team.

07

Write an Eval System

Design and implement an automated evaluation system for AI model outputs. You'll define evaluation criteria across multiple dimensions, build both LLM-graded and programmatic scorers.

This course includes

  • 7 self-paced lessons
  • 1 hour of content
  • Claude tutor on every lesson
  • Certificate of completion

Free to start. No credit card required.