ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

LLM Evaluation and Benchmarking

Systematically evaluate large language model performance for your use case. Covers evaluation frameworks, hallucination detection, human evaluation, automated metrics, A/B testing LLMs, and the patterns that prevent shipping LLM features that look impressive in demos but fail in production.

An LLM that performs well on a demo prompt might hallucinate on real user queries. Evaluation is the bridge between impressive demos and reliable production features. Without systematic evaluation, you are shipping vibes — hoping the model works well enough.


Evaluation Dimensions

Correctness:
  Does the output contain factual information?
  Method: Compare against known-correct answers (ground truth)
  Metric: Accuracy, F1 score, exact match

Faithfulness:
  Is the output grounded in the provided context?
  Method: Check if every claim can be attributed to source docs
  Metric: Faithfulness score (0-1)
  Critical for: RAG systems, Q&A, summarization

Relevance:
  Does the output address the user's question?
  Method: LLM-as-judge or human evaluation
  Metric: Relevance score (1-5 scale)

Harmfulness:
  Does the output contain harmful content?
  Method: Safety classifiers, red-team testing
  Metric: Harm rate, refusal rate

Fluency:
  Is the output well-written and coherent?
  Method: Perplexity, human evaluation
  Metric: Fluency score, readability grade

Latency:
  How fast is the response?
  Method: Measure time to first token, total response time
  Metric: P50, P95, P99 latency
  
Cost:
  How much does each call cost?
  Method: Track input/output tokens per call
  Metric: Cost per query, cost per user session

Automated Evaluation

class LLMEvaluator:
    """Automated evaluation pipeline for LLM outputs."""
    
    def evaluate_rag_response(self, query, context, response, ground_truth=None):
        """Evaluate a RAG system response."""
        results = {}
        
        # 1. Faithfulness: Is response grounded in context?
        results["faithfulness"] = self.check_faithfulness(response, context)
        
        # 2. Relevance: Does response answer the question?
        results["relevance"] = self.check_relevance(query, response)
        
        # 3. Correctness: Does response match ground truth?
        if ground_truth:
            results["correctness"] = self.check_correctness(response, ground_truth)
        
        # 4. Hallucination: Claims not supported by context
        results["hallucination_rate"] = self.detect_hallucinations(response, context)
        
        return results
    
    def check_faithfulness(self, response, context):
        """Use LLM-as-judge to verify grounding."""
        prompt = f"""
        Given the context below, determine if every claim in the response
        is supported by the context. Return a score between 0 and 1.
        
        Context: {context}
        Response: {response}
        
        Score (0 = completely unfaithful, 1 = fully grounded):
        """
        score = self.judge_llm.generate(prompt)
        return float(score)
    
    def detect_hallucinations(self, response, context):
        """Identify claims in response not supported by context."""
        prompt = f"""
        Extract all factual claims from the response.
        For each claim, determine if it is supported by the context.
        Return claims with labels: SUPPORTED or HALLUCINATED.
        
        Context: {context}
        Response: {response}
        """
        claims = self.judge_llm.generate(prompt)
        hallucinated = [c for c in claims if c["label"] == "HALLUCINATED"]
        return len(hallucinated) / max(len(claims), 1)

Evaluation Dataset

# Build a golden evaluation dataset
eval_dataset = [
    {
        "query": "What is the refund policy for annual plans?",
        "context": "Annual plans have a 30-day money-back guarantee...",
        "ground_truth": "Annual plans can be refunded within 30 days.",
        "category": "billing",
        "difficulty": "easy",
    },
    {
        "query": "Can I upgrade mid-cycle?",
        "context": "Users can upgrade at any time. The price difference...",
        "ground_truth": "Yes, users can upgrade at any time with prorated billing.",
        "category": "billing",
        "difficulty": "medium",
    },
    # 200+ examples covering edge cases, adversarial queries, etc.
]

# Run evaluation
results = evaluator.evaluate_dataset(eval_dataset)
print(f"Faithfulness: {results.avg_faithfulness:.2f}")
print(f"Relevance:    {results.avg_relevance:.2f}")
print(f"Correctness:  {results.avg_correctness:.2f}")
print(f"Hallucination rate: {results.avg_hallucination:.2%}")

Anti-Patterns

Anti-PatternConsequenceFix
Evaluate on cherry-picked examplesModel fails on edge casesDiverse eval dataset (200+ examples)
Only measure accuracyFast, expensive, hallucinating model shipsMulti-dimensional evaluation
No regression testingModel updates break existing featuresRun eval suite on every model change
Trust demo performanceDemo prompts != real user queriesEvaluate on production query distribution
No human evaluationAutomated metrics miss nuanceRegular human eval (weekly samples)

LLM evaluation is not a one-time check — it is a continuous process. Every model change, every prompt update, every context pipeline modification should trigger your evaluation suite.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →