LLM Evaluation and Benchmarking
Systematically evaluate large language model performance for your use case. Covers evaluation frameworks, hallucination detection, human evaluation, automated metrics, A/B testing LLMs, and the patterns that prevent shipping LLM features that look impressive in demos but fail in production.
An LLM that performs well on a demo prompt might hallucinate on real user queries. Evaluation is the bridge between impressive demos and reliable production features. Without systematic evaluation, you are shipping vibes — hoping the model works well enough.
Evaluation Dimensions
Correctness:
Does the output contain factual information?
Method: Compare against known-correct answers (ground truth)
Metric: Accuracy, F1 score, exact match
Faithfulness:
Is the output grounded in the provided context?
Method: Check if every claim can be attributed to source docs
Metric: Faithfulness score (0-1)
Critical for: RAG systems, Q&A, summarization
Relevance:
Does the output address the user's question?
Method: LLM-as-judge or human evaluation
Metric: Relevance score (1-5 scale)
Harmfulness:
Does the output contain harmful content?
Method: Safety classifiers, red-team testing
Metric: Harm rate, refusal rate
Fluency:
Is the output well-written and coherent?
Method: Perplexity, human evaluation
Metric: Fluency score, readability grade
Latency:
How fast is the response?
Method: Measure time to first token, total response time
Metric: P50, P95, P99 latency
Cost:
How much does each call cost?
Method: Track input/output tokens per call
Metric: Cost per query, cost per user session
Automated Evaluation
class LLMEvaluator:
"""Automated evaluation pipeline for LLM outputs."""
def evaluate_rag_response(self, query, context, response, ground_truth=None):
"""Evaluate a RAG system response."""
results = {}
# 1. Faithfulness: Is response grounded in context?
results["faithfulness"] = self.check_faithfulness(response, context)
# 2. Relevance: Does response answer the question?
results["relevance"] = self.check_relevance(query, response)
# 3. Correctness: Does response match ground truth?
if ground_truth:
results["correctness"] = self.check_correctness(response, ground_truth)
# 4. Hallucination: Claims not supported by context
results["hallucination_rate"] = self.detect_hallucinations(response, context)
return results
def check_faithfulness(self, response, context):
"""Use LLM-as-judge to verify grounding."""
prompt = f"""
Given the context below, determine if every claim in the response
is supported by the context. Return a score between 0 and 1.
Context: {context}
Response: {response}
Score (0 = completely unfaithful, 1 = fully grounded):
"""
score = self.judge_llm.generate(prompt)
return float(score)
def detect_hallucinations(self, response, context):
"""Identify claims in response not supported by context."""
prompt = f"""
Extract all factual claims from the response.
For each claim, determine if it is supported by the context.
Return claims with labels: SUPPORTED or HALLUCINATED.
Context: {context}
Response: {response}
"""
claims = self.judge_llm.generate(prompt)
hallucinated = [c for c in claims if c["label"] == "HALLUCINATED"]
return len(hallucinated) / max(len(claims), 1)
Evaluation Dataset
# Build a golden evaluation dataset
eval_dataset = [
{
"query": "What is the refund policy for annual plans?",
"context": "Annual plans have a 30-day money-back guarantee...",
"ground_truth": "Annual plans can be refunded within 30 days.",
"category": "billing",
"difficulty": "easy",
},
{
"query": "Can I upgrade mid-cycle?",
"context": "Users can upgrade at any time. The price difference...",
"ground_truth": "Yes, users can upgrade at any time with prorated billing.",
"category": "billing",
"difficulty": "medium",
},
# 200+ examples covering edge cases, adversarial queries, etc.
]
# Run evaluation
results = evaluator.evaluate_dataset(eval_dataset)
print(f"Faithfulness: {results.avg_faithfulness:.2f}")
print(f"Relevance: {results.avg_relevance:.2f}")
print(f"Correctness: {results.avg_correctness:.2f}")
print(f"Hallucination rate: {results.avg_hallucination:.2%}")
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| Evaluate on cherry-picked examples | Model fails on edge cases | Diverse eval dataset (200+ examples) |
| Only measure accuracy | Fast, expensive, hallucinating model ships | Multi-dimensional evaluation |
| No regression testing | Model updates break existing features | Run eval suite on every model change |
| Trust demo performance | Demo prompts != real user queries | Evaluate on production query distribution |
| No human evaluation | Automated metrics miss nuance | Regular human eval (weekly samples) |
LLM evaluation is not a one-time check — it is a continuous process. Every model change, every prompt update, every context pipeline modification should trigger your evaluation suite.