ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

LLM Guardrails and Safety

Implement safety guardrails for LLM-powered applications. Covers input validation, output filtering, content policies, jailbreak prevention, PII redaction, and the patterns that make LLM applications safe for production use.

LLMs are powerful but unpredictable. They can generate harmful content, leak training data, follow injected instructions, and produce plausible but dangerous misinformation. Guardrails are the engineering controls that constrain LLM behavior to keep applications safe, compliant, and trustworthy in production.


Threat Model

Input Threats:
  Prompt injection: User tricks LLM into ignoring system prompt
  Jailbreaking: User bypasses safety training
  PII in prompts: User submits sensitive data to LLM
  Excessive context: User sends massive input to exhaust tokens
  
Output Threats:
  Harmful content: Violence, hate, illegal advice
  Hallucination: Confident but incorrect information
  PII leakage: Model reveals training data
  Code execution: Model generates dangerous code
  Brand risk: Inappropriate or off-brand responses
  
System Threats:
  Cost abuse: Adversary generates expensive API calls
  Data exfiltration: Model used to extract system prompt
  Recursive agents: Uncontrolled tool use loops

Input Guardrails

class InputGuardrails:
    def validate_input(self, user_input: str) -> GuardrailResult:
        """Run all input checks before sending to LLM."""
        
        # 1. Length check
        if len(user_input) > 10_000:
            return GuardrailResult.BLOCK("Input too long")
        
        # 2. PII detection and redaction
        pii_result = self.detect_pii(user_input)
        if pii_result.has_pii:
            user_input = self.redact_pii(user_input, pii_result)
        
        # 3. Prompt injection detection
        injection_score = self.detect_injection(user_input)
        if injection_score > 0.8:
            return GuardrailResult.BLOCK("Potential prompt injection detected")
        
        # 4. Topic classification
        topic = self.classify_topic(user_input)
        if topic in self.blocked_topics:
            return GuardrailResult.BLOCK(f"Topic not supported: {topic}")
        
        return GuardrailResult.ALLOW(cleaned_input=user_input)
    
    def detect_injection(self, text: str) -> float:
        """Score how likely the input contains a prompt injection."""
        injection_patterns = [
            "ignore previous instructions",
            "ignore all instructions",
            "you are now",
            "forget everything",
            "disregard the system prompt",
            "act as if you are",
            "pretend you are",
        ]
        
        text_lower = text.lower()
        matches = sum(1 for p in injection_patterns if p in text_lower)
        
        # Also use a trained classifier for sophisticated attacks
        classifier_score = self.injection_classifier.predict(text)
        
        return max(matches / len(injection_patterns), classifier_score)

Output Guardrails

class OutputGuardrails:
    def validate_output(self, llm_response: str, context: dict) -> GuardrailResult:
        """Run all output checks before returning to user."""
        
        # 1. Content safety classification
        safety = self.content_classifier.classify(llm_response)
        if safety.is_harmful:
            return GuardrailResult.BLOCK(
                "Response flagged as potentially harmful",
                category=safety.category,
            )
        
        # 2. PII in output (model might leak training data)
        pii_result = self.detect_pii(llm_response)
        if pii_result.has_pii:
            llm_response = self.redact_pii(llm_response, pii_result)
        
        # 3. Factual grounding check (if RAG)
        if context.get("source_documents"):
            grounding = self.check_grounding(
                response=llm_response,
                sources=context["source_documents"],
            )
            if grounding.score < 0.5:
                return GuardrailResult.WARN(
                    "Low confidence in factual accuracy",
                    response=llm_response + "\n\n⚠️ This response may not be fully accurate.",
                )
        
        # 4. Brand voice check
        if not self.matches_brand_voice(llm_response, context.get("brand_guidelines")):
            llm_response = self.adjust_tone(llm_response)
        
        return GuardrailResult.ALLOW(response=llm_response)

Anti-Patterns

Anti-PatternConsequenceFix
Rely only on LLM’s built-in safetyJailbreaks bypass itLayer external guardrails
Block without loggingCannot improve, cannot auditLog all blocks with reason and input
Same guardrails for all use casesOver-blocking or under-blockingContext-specific policies
No cost limitsSingle user consumes $10K in API callsPer-user rate limits and spend caps
No human review pipelineEdge cases never improveFlag uncertain cases for review

Guardrails are not about making LLMs perfect — they are about making them predictably imperfect. Constrain the failure modes, log the edge cases, and improve continuously.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →