What you'll learn
Your instructor
Dr. Lena Schulz
AI Safety Researcher, Anthropic
Lena works on scalable oversight and Constitutional AI.
Syllabus
The Cost of Shipping Without Evals
Why manual spot-checking fails, the eval development lifecycle, and how to define success criteria that actually measure something.
Code-Graded Evals
Write deterministic eval functions for classification, extraction, and structured output tasks — the fastest and most reliable evaluation method.
LLM-as-Judge
Use a second model to evaluate outputs that require judgment — tone, helpfulness, reasoning quality, and safety.
Eval Tooling at Scale
Move from hand-built scripts to production eval infrastructure — Promptfoo, the Anthropic Console, and the Batch API.
Red-Teaming and Adversarial Testing
Systematically probe your AI applications for harmful outputs, jailbreaks, data leakage, and failure modes your standard evals won't catch.
Evals in Production
Wire evals into CI/CD, monitor live traffic, apply statistical rigor, and build an eval culture that scales with your team.
Write an Eval System
Design and implement an automated evaluation system for AI model outputs. You'll define evaluation criteria across multiple dimensions, build both LLM-graded and programmatic scorers.