LLM Guardrails and Safety | The Garnet Wiki

LLMs are powerful but unpredictable. They can generate harmful content, leak training data, follow injected instructions, and produce plausible but dangerous misinformation. Guardrails are the engineering controls that constrain LLM behavior to keep applications safe, compliant, and trustworthy in production.

Threat Model

Input Threats:
  Prompt injection: User tricks LLM into ignoring system prompt
  Jailbreaking: User bypasses safety training
  PII in prompts: User submits sensitive data to LLM
  Excessive context: User sends massive input to exhaust tokens
  
Output Threats:
  Harmful content: Violence, hate, illegal advice
  Hallucination: Confident but incorrect information
  PII leakage: Model reveals training data
  Code execution: Model generates dangerous code
  Brand risk: Inappropriate or off-brand responses
  
System Threats:
  Cost abuse: Adversary generates expensive API calls
  Data exfiltration: Model used to extract system prompt
  Recursive agents: Uncontrolled tool use loops

Input Guardrails

class InputGuardrails:
    def validate_input(self, user_input: str) -> GuardrailResult:
        """Run all input checks before sending to LLM."""
        
        # 1. Length check
        if len(user_input) > 10_000:
            return GuardrailResult.BLOCK("Input too long")
        
        # 2. PII detection and redaction
        pii_result = self.detect_pii(user_input)
        if pii_result.has_pii:
            user_input = self.redact_pii(user_input, pii_result)
        
        # 3. Prompt injection detection
        injection_score = self.detect_injection(user_input)
        if injection_score > 0.8:
            return GuardrailResult.BLOCK("Potential prompt injection detected")
        
        # 4. Topic classification
        topic = self.classify_topic(user_input)
        if topic in self.blocked_topics:
            return GuardrailResult.BLOCK(f"Topic not supported: {topic}")
        
        return GuardrailResult.ALLOW(cleaned_input=user_input)
    
    def detect_injection(self, text: str) -> float:
        """Score how likely the input contains a prompt injection."""
        injection_patterns = [
            "ignore previous instructions",
            "ignore all instructions",
            "you are now",
            "forget everything",
            "disregard the system prompt",
            "act as if you are",
            "pretend you are",
        ]
        
        text_lower = text.lower()
        matches = sum(1 for p in injection_patterns if p in text_lower)
        
        # Also use a trained classifier for sophisticated attacks
        classifier_score = self.injection_classifier.predict(text)
        
        return max(matches / len(injection_patterns), classifier_score)

Output Guardrails

class OutputGuardrails:
    def validate_output(self, llm_response: str, context: dict) -> GuardrailResult:
        """Run all output checks before returning to user."""
        
        # 1. Content safety classification
        safety = self.content_classifier.classify(llm_response)
        if safety.is_harmful:
            return GuardrailResult.BLOCK(
                "Response flagged as potentially harmful",
                category=safety.category,
            )
        
        # 2. PII in output (model might leak training data)
        pii_result = self.detect_pii(llm_response)
        if pii_result.has_pii:
            llm_response = self.redact_pii(llm_response, pii_result)
        
        # 3. Factual grounding check (if RAG)
        if context.get("source_documents"):
            grounding = self.check_grounding(
                response=llm_response,
                sources=context["source_documents"],
            )
            if grounding.score < 0.5:
                return GuardrailResult.WARN(
                    "Low confidence in factual accuracy",
                    response=llm_response + "\n\n⚠️ This response may not be fully accurate.",
                )
        
        # 4. Brand voice check
        if not self.matches_brand_voice(llm_response, context.get("brand_guidelines")):
            llm_response = self.adjust_tone(llm_response)
        
        return GuardrailResult.ALLOW(response=llm_response)

Anti-Patterns

Anti-Pattern	Consequence	Fix
Rely only on LLM’s built-in safety	Jailbreaks bypass it	Layer external guardrails
Block without logging	Cannot improve, cannot audit	Log all blocks with reason and input
Same guardrails for all use cases	Over-blocking or under-blocking	Context-specific policies
No cost limits	Single user consumes $10K in API calls	Per-user rate limits and spend caps
No human review pipeline	Edge cases never improve	Flag uncertain cases for review

Guardrails are not about making LLMs perfect — they are about making them predictably imperfect. Constrain the failure modes, log the edge cases, and improve continuously.

Threat Model

Input Guardrails

Output Guardrails

Anti-Patterns

More in AI Engineering

AI Agent Orchestration

AI Agent Tool Selection Optimization

AI Agent Architecture