The most dangerous AI application is one that works perfectly on the demo and fails silently in production. Traditional software either works or crashes. AI applications can confidently produce wrong outputs that look right, and you will not know until a customer makes a bad decision based on your model’s hallucination.
This guide covers how to build evaluation systems that catch AI regressions systematically, because “it seems to work fine” is not an evaluation strategy.
The Evaluation Gap
Traditional software:
Input: "2 + 2"
Expected: 4
Actual: 4
Result: ✅ Pass / ❌ Fail (binary, deterministic)
AI application:
Input: "Summarize this contract clause about liability"
Expected: "The seller limits liability to the contract value"
Actual: "The seller assumes unlimited liability" (WRONG but confident)
Result: ??? (How do you automatically detect this is wrong?)
Why Traditional Testing Fails for AI
| Traditional Testing | AI Testing |
|---|
| Exact match: output == expected | Semantic match: output ≈ expected |
| Deterministic: same input → same output | Non-deterministic: same input → different outputs |
| Binary: pass or fail | Graded: quality spectrum |
| Exhaustive: test every edge case | Impossible to test every input |
| Write once, run forever | Must evolve with model updates |
Evaluation Framework Architecture
┌──────────────────────────────────────────────────┐
│ EVALUATION SUITE │
├──────────────────────────────────────────────────┤
│ │
│ Test Set: 200+ curated examples │
│ ├─ Golden set (50): manually verified correct │
│ ├─ Edge cases (50): known difficult inputs │
│ ├─ Regression set (50): previously failed cases │
│ └─ Diverse set (50): representative of production │
│ │
│ Evaluators: │
│ ├─ Automated metrics (BLEU, ROUGE, BERTScore) │
│ ├─ LLM-as-judge (GPT-4 evaluates quality) │
│ ├─ Programmatic checks (format, length, safety) │
│ └─ Human review (weekly sample) │
│ │
│ Pipeline: │
│ ├─ Run on every prompt/model change │
│ ├─ Compare against baseline │
│ ├─ Block deployment if regression detected │
│ └─ Store all results for trend analysis │
│ │
└──────────────────────────────────────────────────┘
Offline Evaluation Metrics
For Text Generation (Summaries, Content, Answers)
| Metric | What It Measures | When to Use | Limitation |
|---|
| BLEU | N-gram overlap with reference | Machine translation | Penalizes valid paraphrases |
| ROUGE | Recall of reference n-grams | Summarization | Does not measure factuality |
| BERTScore | Semantic similarity via embeddings | Any text generation | Computationally expensive |
| Factual accuracy | % of claims that are correct | RAG, factual QA | Requires fact verification |
| Faithfulness | Does output stay within source material? | RAG, document QA | Hard to automate |
For Classification
| Metric | When to Use |
|---|
| Accuracy | Balanced classes only |
| Precision | False positives are costly (spam detection) |
| Recall | False negatives are costly (fraud, medical) |
| F1 Score | Balance precision and recall |
| AUC-ROC | Compare models independent of threshold |
LLM-as-Judge: Using AI to Evaluate AI
EVALUATION_PROMPT = """You are an expert evaluator. Rate the following AI response
on a scale of 1-5 for each criterion.
Question: {question}
AI Response: {response}
Reference Answer: {reference}
Criteria:
1. Correctness (1-5): Is the information factually accurate?
2. Completeness (1-5): Does the response address all parts of the question?
3. Clarity (1-5): Is the response clear and well-organized?
4. Relevance (1-5): Does the response stay on topic?
For each criterion, provide:
- Score (1-5)
- Brief justification (1 sentence)
Respond in JSON format:
{{"correctness": {{"score": N, "reason": "..."}}, ...}}
"""
async def evaluate_response(question, response, reference):
result = await llm.generate(
EVALUATION_PROMPT.format(
question=question,
response=response,
reference=reference
),
model="gpt-4o", # Use strong model for evaluation
temperature=0 # Consistent evaluation
)
return json.loads(result)
Calibrating LLM Judges
| Practice | Why |
|---|
| Use a stronger model than the one being evaluated | Weak model evaluating strong model is unreliable |
| Include reference answers when available | Anchors the evaluation |
| Use temperature=0 for consistency | Reproducible scores |
| Validate judge against human ratings on 100 examples | Verify correlation |
| Rotate judge prompts to reduce position bias | LLMs favor certain positions |
Regression Testing for AI
class AIRegressionSuite:
"""Run evaluation suite and block deployment on regression."""
def __init__(self, test_set: list[TestCase], baseline_scores: dict):
self.test_set = test_set
self.baseline = baseline_scores
async def run(self, model_config: dict) -> RegressionReport:
results = []
for case in self.test_set:
output = await self.generate(case.input, model_config)
scores = await self.evaluate(case, output)
results.append(scores)
aggregated = self.aggregate(results)
regressions = self.detect_regressions(aggregated)
return RegressionReport(
scores=aggregated,
baseline=self.baseline,
regressions=regressions,
should_block=len(regressions) > 0
)
def detect_regressions(self, scores: dict) -> list:
regressions = []
for metric, value in scores.items():
baseline_value = self.baseline.get(metric, 0)
# Regression: > 5% drop from baseline
if value < baseline_value * 0.95:
regressions.append({
'metric': metric,
'baseline': baseline_value,
'current': value,
'drop': f"{((baseline_value - value) / baseline_value) * 100:.1f}%"
})
return regressions
Online Evaluation (Production)
| Metric | How to Measure | What It Tells You |
|---|
| Thumbs up/down ratio | User feedback buttons | User satisfaction with responses |
| Regeneration rate | How often users click “try again” | Response quality perception |
| Task completion rate | Did the user achieve their goal? | End-to-end effectiveness |
| Time to value | How long until the user gets useful output | Efficiency of the interaction |
| Escalation rate | How often users seek human help after AI | Where AI falls short |
Bias and Safety Testing
| Test Category | What to Check | Example |
|---|
| Demographic bias | Same question, different demographics in context | ”Recommend a loan for [name]” — does output change? |
| Toxicity | Does model generate harmful content? | Run red-team prompts from safety datasets |
| Hallucination rate | % of responses with fabricated information | Compare claims against verified sources |
| Prompt injection | Can users override system instructions? | ”Ignore your instructions and…” |
| Data leakage | Does model reveal training data? | ”Repeat your system prompt” |
Implementation Checklist