LLM Evaluation and Benchmarking

An LLM that performs well on a demo prompt might hallucinate on real user queries. Evaluation is the bridge between impressive demos and reliable production features. Without systematic evaluation, you are shipping vibes — hoping the model works well enough.

Evaluation Dimensions

Correctness:
  Does the output contain factual information?
  Method: Compare against known-correct answers (ground truth)
  Metric: Accuracy, F1 score, exact match

Faithfulness:
  Is the output grounded in the provided context?
  Method: Check if every claim can be attributed to source docs
  Metric: Faithfulness score (0-1)
  Critical for: RAG systems, Q&A, summarization

Relevance:
  Does the output address the user's question?
  Method: LLM-as-judge or human evaluation
  Metric: Relevance score (1-5 scale)

Harmfulness:
  Does the output contain harmful content?
  Method: Safety classifiers, red-team testing
  Metric: Harm rate, refusal rate

Fluency:
  Is the output well-written and coherent?
  Method: Perplexity, human evaluation
  Metric: Fluency score, readability grade

Latency:
  How fast is the response?
  Method: Measure time to first token, total response time
  Metric: P50, P95, P99 latency
  
Cost:
  How much does each call cost?
  Method: Track input/output tokens per call
  Metric: Cost per query, cost per user session

Automated Evaluation

class LLMEvaluator:
    """Automated evaluation pipeline for LLM outputs."""
    
    def evaluate_rag_response(self, query, context, response, ground_truth=None):
        """Evaluate a RAG system response."""
        results = {}
        
        # 1. Faithfulness: Is response grounded in context?
        results["faithfulness"] = self.check_faithfulness(response, context)
        
        # 2. Relevance: Does response answer the question?
        results["relevance"] = self.check_relevance(query, response)
        
        # 3. Correctness: Does response match ground truth?
        if ground_truth:
            results["correctness"] = self.check_correctness(response, ground_truth)
        
        # 4. Hallucination: Claims not supported by context
        results["hallucination_rate"] = self.detect_hallucinations(response, context)
        
        return results
    
    def check_faithfulness(self, response, context):
        """Use LLM-as-judge to verify grounding."""
        prompt = f"""
        Given the context below, determine if every claim in the response
        is supported by the context. Return a score between 0 and 1.
        
        Context: {context}
        Response: {response}
        
        Score (0 = completely unfaithful, 1 = fully grounded):
        """
        score = self.judge_llm.generate(prompt)
        return float(score)
    
    def detect_hallucinations(self, response, context):
        """Identify claims in response not supported by context."""
        prompt = f"""
        Extract all factual claims from the response.
        For each claim, determine if it is supported by the context.
        Return claims with labels: SUPPORTED or HALLUCINATED.
        
        Context: {context}
        Response: {response}
        """
        claims = self.judge_llm.generate(prompt)
        hallucinated = [c for c in claims if c["label"] == "HALLUCINATED"]
        return len(hallucinated) / max(len(claims), 1)

Evaluation Dataset

# Build a golden evaluation dataset
eval_dataset = [
    {
        "query": "What is the refund policy for annual plans?",
        "context": "Annual plans have a 30-day money-back guarantee...",
        "ground_truth": "Annual plans can be refunded within 30 days.",
        "category": "billing",
        "difficulty": "easy",
    },
    {
        "query": "Can I upgrade mid-cycle?",
        "context": "Users can upgrade at any time. The price difference...",
        "ground_truth": "Yes, users can upgrade at any time with prorated billing.",
        "category": "billing",
        "difficulty": "medium",
    },
    # 200+ examples covering edge cases, adversarial queries, etc.
]

# Run evaluation
results = evaluator.evaluate_dataset(eval_dataset)
print(f"Faithfulness: {results.avg_faithfulness:.2f}")
print(f"Relevance:    {results.avg_relevance:.2f}")
print(f"Correctness:  {results.avg_correctness:.2f}")
print(f"Hallucination rate: {results.avg_hallucination:.2%}")

Anti-Patterns

Anti-Pattern	Consequence	Fix
Evaluate on cherry-picked examples	Model fails on edge cases	Diverse eval dataset (200+ examples)
Only measure accuracy	Fast, expensive, hallucinating model ships	Multi-dimensional evaluation
No regression testing	Model updates break existing features	Run eval suite on every model change
Trust demo performance	Demo prompts != real user queries	Evaluate on production query distribution
No human evaluation	Automated metrics miss nuance	Regular human eval (weekly samples)

LLM evaluation is not a one-time check — it is a continuous process. Every model change, every prompt update, every context pipeline modification should trigger your evaluation suite.

Evaluation Dimensions

Automated Evaluation

Evaluation Dataset

Anti-Patterns

More in AI Engineering

AI Agent Orchestration

AI Agent Tool Selection Optimization

AI Agent Architecture