LLM Guardrails and Safety
Implement safety guardrails for LLM-powered applications. Covers input validation, output filtering, content policies, jailbreak prevention, PII redaction, and the patterns that make LLM applications safe for production use.
LLMs are powerful but unpredictable. They can generate harmful content, leak training data, follow injected instructions, and produce plausible but dangerous misinformation. Guardrails are the engineering controls that constrain LLM behavior to keep applications safe, compliant, and trustworthy in production.
Threat Model
Input Threats:
Prompt injection: User tricks LLM into ignoring system prompt
Jailbreaking: User bypasses safety training
PII in prompts: User submits sensitive data to LLM
Excessive context: User sends massive input to exhaust tokens
Output Threats:
Harmful content: Violence, hate, illegal advice
Hallucination: Confident but incorrect information
PII leakage: Model reveals training data
Code execution: Model generates dangerous code
Brand risk: Inappropriate or off-brand responses
System Threats:
Cost abuse: Adversary generates expensive API calls
Data exfiltration: Model used to extract system prompt
Recursive agents: Uncontrolled tool use loops
Input Guardrails
class InputGuardrails:
def validate_input(self, user_input: str) -> GuardrailResult:
"""Run all input checks before sending to LLM."""
# 1. Length check
if len(user_input) > 10_000:
return GuardrailResult.BLOCK("Input too long")
# 2. PII detection and redaction
pii_result = self.detect_pii(user_input)
if pii_result.has_pii:
user_input = self.redact_pii(user_input, pii_result)
# 3. Prompt injection detection
injection_score = self.detect_injection(user_input)
if injection_score > 0.8:
return GuardrailResult.BLOCK("Potential prompt injection detected")
# 4. Topic classification
topic = self.classify_topic(user_input)
if topic in self.blocked_topics:
return GuardrailResult.BLOCK(f"Topic not supported: {topic}")
return GuardrailResult.ALLOW(cleaned_input=user_input)
def detect_injection(self, text: str) -> float:
"""Score how likely the input contains a prompt injection."""
injection_patterns = [
"ignore previous instructions",
"ignore all instructions",
"you are now",
"forget everything",
"disregard the system prompt",
"act as if you are",
"pretend you are",
]
text_lower = text.lower()
matches = sum(1 for p in injection_patterns if p in text_lower)
# Also use a trained classifier for sophisticated attacks
classifier_score = self.injection_classifier.predict(text)
return max(matches / len(injection_patterns), classifier_score)
Output Guardrails
class OutputGuardrails:
def validate_output(self, llm_response: str, context: dict) -> GuardrailResult:
"""Run all output checks before returning to user."""
# 1. Content safety classification
safety = self.content_classifier.classify(llm_response)
if safety.is_harmful:
return GuardrailResult.BLOCK(
"Response flagged as potentially harmful",
category=safety.category,
)
# 2. PII in output (model might leak training data)
pii_result = self.detect_pii(llm_response)
if pii_result.has_pii:
llm_response = self.redact_pii(llm_response, pii_result)
# 3. Factual grounding check (if RAG)
if context.get("source_documents"):
grounding = self.check_grounding(
response=llm_response,
sources=context["source_documents"],
)
if grounding.score < 0.5:
return GuardrailResult.WARN(
"Low confidence in factual accuracy",
response=llm_response + "\n\n⚠️ This response may not be fully accurate.",
)
# 4. Brand voice check
if not self.matches_brand_voice(llm_response, context.get("brand_guidelines")):
llm_response = self.adjust_tone(llm_response)
return GuardrailResult.ALLOW(response=llm_response)
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| Rely only on LLM’s built-in safety | Jailbreaks bypass it | Layer external guardrails |
| Block without logging | Cannot improve, cannot audit | Log all blocks with reason and input |
| Same guardrails for all use cases | Over-blocking or under-blocking | Context-specific policies |
| No cost limits | Single user consumes $10K in API calls | Per-user rate limits and spend caps |
| No human review pipeline | Edge cases never improve | Flag uncertain cases for review |
Guardrails are not about making LLMs perfect — they are about making them predictably imperfect. Constrain the failure modes, log the edge cases, and improve continuously.