LLM Fine-Tuning vs RAG vs Prompt Engineering: Decision Guide
Choose the right approach for customizing large language models. Covers when to use fine-tuning, RAG, or prompt engineering, with cost analysis, implementation complexity, and decision framework.
You have proprietary data and you want an LLM to use it. Three approaches exist: prompt engineering (cheapest), RAG (most flexible), and fine-tuning (most powerful). Most teams default to fine-tuning because it sounds impressive. Most teams should start with RAG because it actually works for their use case. This guide helps you choose.
The Decision Matrix
| Factor | Prompt Engineering | RAG | Fine-Tuning |
|---|---|---|---|
| Implementation time | Hours | Days–Weeks | Weeks–Months |
| Cost | $0 (API usage only) | Low–Medium | High |
| Data required | 0 examples | Documents/knowledge base | 1,000–100,000+ examples |
| Customization depth | Behavior & format | Knowledge & context | Behavior, tone, & knowledge |
| Latency | Lowest | Medium (retrieval + generation) | Low (no retrieval step) |
| Maintenance | Minimal | Index updates | Retraining on new data |
| Hallucination risk | High | Low (grounded in sources) | Medium |
Start Here: Prompt Engineering
The simplest and most underestimated approach. Before building infrastructure, try these techniques:
System Prompts
system_prompt = """
You are a senior financial analyst at Garnet Grid Consulting.
You specialize in enterprise technology cost analysis.
Rules:
- Always cite specific numbers when discussing costs
- Use bullet points for comparisons
- If you don't know something, say "I don't have enough data"
- Never provide legal or compliance advice
- Format currency as USD with commas (e.g., $1,250,000)
"""
Few-Shot Examples
few_shot_prompt = """
Given a cloud infrastructure description, estimate monthly costs.
Example 1:
Input: "3 EC2 m5.xlarge instances running 24/7 in us-east-1"
Output: Estimated monthly cost: $462 ($154/instance × 3)
- On-demand: $462/mo
- 1-year reserved (no upfront): $296/mo (36% savings)
- 3-year reserved (all upfront): $190/mo (59% savings)
Example 2:
Input: "500GB S3 storage with 1M GET requests/month"
Output: Estimated monthly cost: $15.50
- Storage: $11.50 (500GB × $0.023/GB)
- Requests: $0.40 (1M × $0.0004/1000)
- Data transfer: ~$4 (estimate)
Now analyze:
Input: "{user_input}"
"""
Chain-of-Thought
cot_prompt = """
Analyze this architecture decision step by step:
Step 1: Identify the core requirements
Step 2: List the constraints (budget, timeline, team skills)
Step 3: Evaluate each option against requirements
Step 4: Identify risks for each option
Step 5: Make a recommendation with justification
Architecture decision: {user_input}
"""
When prompt engineering is enough:
- You need the LLM to follow a specific format or tone
- The task requires general knowledge (not proprietary data)
- You have < 20 “rules” the model needs to follow
- Response quality is acceptable with the right prompt structure
Level Up: RAG (Retrieval-Augmented Generation)
RAG connects your LLM to external knowledge. Instead of training the model on your data, you retrieve relevant documents at query time and include them in the context.
Architecture
User Query → Embedding Model → Vector Search → Top-K Documents
↓
LLM Prompt + Retrieved Context
↓
Generated Answer
Implementation
from openai import OpenAI
import pinecone
client = OpenAI()
# 1. Embed the query
query = "What is our SLA for enterprise customers?"
query_embedding = client.embeddings.create(
input=query, model="text-embedding-3-small"
).data[0].embedding
# 2. Search vector database for relevant documents
index = pinecone.Index("company-docs")
results = index.query(vector=query_embedding, top_k=5, include_metadata=True)
# 3. Build context from retrieved documents
context = "\n\n".join([
f"Source: {r.metadata['source']}\n{r.metadata['text']}"
for r in results.matches
])
# 4. Generate answer grounded in retrieved context
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": """Answer based ONLY on the provided context.
If the context doesn't contain the answer, say 'Not found in documentation.'
Always cite the source document."""},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
]
)
Chunking Strategies
| Strategy | Chunk Size | Best For |
|---|---|---|
| Fixed-size | 500–1000 tokens | General documents |
| Paragraph-based | Variable | Well-structured docs |
| Semantic | Variable | Mixed-format content |
| Recursive character | 500–1500 chars | Code + docs |
| Sentence window | 3–5 sentences | Precise Q&A |
# Recursive character splitting with overlap
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_text(document_text)
When to Use RAG
- Your knowledge base changes frequently (docs, policies, FAQs)
- You need answers grounded in specific source documents
- You need citation/attribution for compliance
- Your dataset is too large for the context window
- You want to avoid fine-tuning costs and complexity
Full Power: Fine-Tuning
Fine-tuning modifies the model’s weights by training on your domain-specific data. This changes what the model knows and how it behaves.
When Fine-Tuning Is Worth It
| Scenario | Why Fine-Tuning Helps |
|---|---|
| Consistent output format | Train on 1000+ examples of desired format |
| Domain-specific terminology | Model learns jargon without prompt explanation |
| Classification tasks | Model learns your specific taxonomy |
| Style/tone matching | Replicate a specific writing voice |
| Reducing prompt length | Knowledge is in weights, not the prompt |
| Edge cases | Handle nuanced scenarios prompt engineering can’t |
Data Preparation
{"messages": [
{"role": "system", "content": "You are Garnet Grid's architecture assessment AI."},
{"role": "user", "content": "Assess this architecture: monolithic Java app, Oracle DB, on-prem VMware"},
{"role": "assistant", "content": "**Architecture Risk Score: 7/10 (High)**\n\n**Risks:**\n- Vendor lock-in: Oracle licensing + VMware dependency\n- Scalability: Monolithic deployment limits horizontal scaling\n- Talent: Shrinking Java/Oracle talent pool\n\n**Recommendations:**\n1. Strangler fig migration to microservices (6-12 months)\n2. Database: Migrate to PostgreSQL (3-month effort)\n3. Infrastructure: Containerize with Docker, deploy to AKS\n\n**Estimated migration cost:** $350K-$500K over 12 months"}
]}
Fine-Tuning Process
from openai import OpenAI
client = OpenAI()
# 1. Upload training data
file = client.files.create(
file=open("training_data.jsonl", "rb"),
purpose="fine-tune"
)
# 2. Create fine-tuning job
job = client.fine_tuning.jobs.create(
training_file=file.id,
model="gpt-4o-mini-2024-07-18",
hyperparameters={
"n_epochs": 3,
"batch_size": 4,
"learning_rate_multiplier": 1.0
}
)
# 3. Monitor progress
status = client.fine_tuning.jobs.retrieve(job.id)
print(f"Status: {status.status}") # queued → running → succeeded
# 4. Use fine-tuned model
response = client.chat.completions.create(
model="ft:gpt-4o-mini-2024-07-18:garnet-grid:arch-assess:abc123",
messages=[{"role": "user", "content": "Assess this architecture: ..."}]
)
Fine-Tuning Cost Analysis
| Model | Training Cost | Inference Cost | Minimum Examples |
|---|---|---|---|
| GPT-4o mini | $0.30/M tokens | $0.60/$2.40 per M tokens | 10 (50+ recommended) |
| GPT-4o | $2.50/M tokens | $3.75/$15 per M tokens | 10 (50+ recommended) |
| Claude (via AWS) | Varies | Varies | Custom |
| Open-source (Llama 3) | GPU cost only | Self-hosted cost | 1,000+ |
The Hybrid Approach
Production systems often combine all three:
┌─────────────────────────────────────────┐
│ User Query │
├─────────────────────────────────────────┤
│ 1. Prompt Engineering │
│ - System prompt sets behavior/format │
│ - Few-shot examples for edge cases │
├─────────────────────────────────────────┤
│ 2. RAG │
│ - Retrieve relevant documents │
│ - Inject into context │
├─────────────────────────────────────────┤
│ 3. Fine-Tuned Model │
│ - Domain-specific weights │
│ - Consistent output style │
├─────────────────────────────────────────┤
│ Response │
└─────────────────────────────────────────┘
Decision Flowchart
Do you need the model to use proprietary/current data?
├── No → Prompt Engineering
└── Yes
├── Does the data change frequently?
│ ├── Yes → RAG
│ └── No → Fine-Tuning OR RAG
├── Do you need citations/sources?
│ └── Yes → RAG
├── Do you need a specific output style/format?
│ └── Yes → Fine-Tuning + RAG
└── Budget < $5K?
└── Yes → RAG (fine-tuning training costs add up)
Cost Comparison: Fine-Tuning vs RAG
| Factor | Fine-Tuning | RAG |
|---|---|---|
| Upfront cost | $500-$50K (training compute) | $100-$1K (embedding + vectorDB setup) |
| Per-query cost | Lower (smaller model possible) | Higher (embedding + retrieval + generation) |
| Data update cost | Full retrain ($$$) | Re-embed changed documents ($) |
| Infrastructure | GPU for training + serving | Vector DB + embedding API + LLM API |
| Time to first result | Days-weeks (data prep + training + eval) | Hours-days (chunk + embed + prompt) |
| Maintenance | Periodic retraining | Index refresh pipeline |
Hybrid Approach: When to Combine Both
The most sophisticated deployments combine fine-tuning AND RAG:
- Fine-tune for style and domain vocabulary — Train the model to speak in your industry language, use your formatting conventions, and match your organization tone
- RAG for factual grounding — Retrieve specific facts, policies, and data from your knowledge base at query time
- Result: A model that sounds like your organization AND references accurate, current information
This hybrid approach is ideal for large enterprises with specific terminology (legal, medical, financial) where both accuracy and tone matter.
Implementation Checklist
- Started with prompt engineering to establish baseline quality
- Evaluated whether RAG solves the problem before fine-tuning
- Chunking strategy tested (size, overlap, metadata)
- Vector database selected and indexed
- Retrieval quality measured (precision@k, recall@k)
- Fine-tuning data prepared (if needed) — minimum 50 high-quality examples
- Evaluation pipeline built (automated + human review)
- Cost per query calculated for production volume
- Monitoring for answer quality degradation
:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For AI strategy consulting, visit garnetgrid.com. :::