LLM Fine-Tuning vs RAG vs Prompt Engineering: Decision Guide

You have proprietary data and you want an LLM to use it. Three approaches exist: prompt engineering (cheapest), RAG (most flexible), and fine-tuning (most powerful). Most teams default to fine-tuning because it sounds impressive. Most teams should start with RAG because it actually works for their use case. This guide helps you choose.

The Decision Matrix

Factor	Prompt Engineering	RAG	Fine-Tuning
Implementation time	Hours	Days–Weeks	Weeks–Months
Cost	$0 (API usage only)	Low–Medium	High
Data required	0 examples	Documents/knowledge base	1,000–100,000+ examples
Customization depth	Behavior & format	Knowledge & context	Behavior, tone, & knowledge
Latency	Lowest	Medium (retrieval + generation)	Low (no retrieval step)
Maintenance	Minimal	Index updates	Retraining on new data
Hallucination risk	High	Low (grounded in sources)	Medium

Start Here: Prompt Engineering

The simplest and most underestimated approach. Before building infrastructure, try these techniques:

System Prompts

system_prompt = """
You are a senior financial analyst at Garnet Grid Consulting.
You specialize in enterprise technology cost analysis.

Rules:
- Always cite specific numbers when discussing costs
- Use bullet points for comparisons
- If you don't know something, say "I don't have enough data"
- Never provide legal or compliance advice
- Format currency as USD with commas (e.g., $1,250,000)
"""

Few-Shot Examples

few_shot_prompt = """
Given a cloud infrastructure description, estimate monthly costs.

Example 1:
Input: "3 EC2 m5.xlarge instances running 24/7 in us-east-1"
Output: Estimated monthly cost: $462 ($154/instance × 3)
- On-demand: $462/mo
- 1-year reserved (no upfront): $296/mo (36% savings)
- 3-year reserved (all upfront): $190/mo (59% savings)

Example 2:
Input: "500GB S3 storage with 1M GET requests/month"
Output: Estimated monthly cost: $15.50
- Storage: $11.50 (500GB × $0.023/GB)
- Requests: $0.40 (1M × $0.0004/1000)
- Data transfer: ~$4 (estimate)

Now analyze:
Input: "{user_input}"
"""

Chain-of-Thought

cot_prompt = """
Analyze this architecture decision step by step:

Step 1: Identify the core requirements
Step 2: List the constraints (budget, timeline, team skills)
Step 3: Evaluate each option against requirements
Step 4: Identify risks for each option
Step 5: Make a recommendation with justification

Architecture decision: {user_input}
"""

When prompt engineering is enough:

You need the LLM to follow a specific format or tone
The task requires general knowledge (not proprietary data)
You have < 20 “rules” the model needs to follow
Response quality is acceptable with the right prompt structure

Level Up: RAG (Retrieval-Augmented Generation)

RAG connects your LLM to external knowledge. Instead of training the model on your data, you retrieve relevant documents at query time and include them in the context.

Architecture

User Query → Embedding Model → Vector Search → Top-K Documents
                                                    ↓
                                        LLM Prompt + Retrieved Context
                                                    ↓
                                              Generated Answer

Implementation

from openai import OpenAI
import pinecone

client = OpenAI()

# 1. Embed the query
query = "What is our SLA for enterprise customers?"
query_embedding = client.embeddings.create(
    input=query, model="text-embedding-3-small"
).data[0].embedding

# 2. Search vector database for relevant documents
index = pinecone.Index("company-docs")
results = index.query(vector=query_embedding, top_k=5, include_metadata=True)

# 3. Build context from retrieved documents
context = "\n\n".join([
    f"Source: {r.metadata['source']}\n{r.metadata['text']}"
    for r in results.matches
])

# 4. Generate answer grounded in retrieved context
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": """Answer based ONLY on the provided context.
         If the context doesn't contain the answer, say 'Not found in documentation.'
         Always cite the source document."""},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
    ]
)

Chunking Strategies

Strategy	Chunk Size	Best For
Fixed-size	500–1000 tokens	General documents
Paragraph-based	Variable	Well-structured docs
Semantic	Variable	Mixed-format content
Recursive character	500–1500 chars	Code + docs
Sentence window	3–5 sentences	Precise Q&A

# Recursive character splitting with overlap
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_text(document_text)

When to Use RAG

Your knowledge base changes frequently (docs, policies, FAQs)
You need answers grounded in specific source documents
You need citation/attribution for compliance
Your dataset is too large for the context window
You want to avoid fine-tuning costs and complexity

Full Power: Fine-Tuning

Fine-tuning modifies the model’s weights by training on your domain-specific data. This changes what the model knows and how it behaves.

When Fine-Tuning Is Worth It

Scenario	Why Fine-Tuning Helps
Consistent output format	Train on 1000+ examples of desired format
Domain-specific terminology	Model learns jargon without prompt explanation
Classification tasks	Model learns your specific taxonomy
Style/tone matching	Replicate a specific writing voice
Reducing prompt length	Knowledge is in weights, not the prompt
Edge cases	Handle nuanced scenarios prompt engineering can’t

Data Preparation

{"messages": [
  {"role": "system", "content": "You are Garnet Grid's architecture assessment AI."},
  {"role": "user", "content": "Assess this architecture: monolithic Java app, Oracle DB, on-prem VMware"},
  {"role": "assistant", "content": "**Architecture Risk Score: 7/10 (High)**\n\n**Risks:**\n- Vendor lock-in: Oracle licensing + VMware dependency\n- Scalability: Monolithic deployment limits horizontal scaling\n- Talent: Shrinking Java/Oracle talent pool\n\n**Recommendations:**\n1. Strangler fig migration to microservices (6-12 months)\n2. Database: Migrate to PostgreSQL (3-month effort)\n3. Infrastructure: Containerize with Docker, deploy to AKS\n\n**Estimated migration cost:** $350K-$500K over 12 months"}
]}

Fine-Tuning Process

from openai import OpenAI
client = OpenAI()

# 1. Upload training data
file = client.files.create(
    file=open("training_data.jsonl", "rb"),
    purpose="fine-tune"
)

# 2. Create fine-tuning job
job = client.fine_tuning.jobs.create(
    training_file=file.id,
    model="gpt-4o-mini-2024-07-18",
    hyperparameters={
        "n_epochs": 3,
        "batch_size": 4,
        "learning_rate_multiplier": 1.0
    }
)

# 3. Monitor progress
status = client.fine_tuning.jobs.retrieve(job.id)
print(f"Status: {status.status}")  # queued → running → succeeded

# 4. Use fine-tuned model
response = client.chat.completions.create(
    model="ft:gpt-4o-mini-2024-07-18:garnet-grid:arch-assess:abc123",
    messages=[{"role": "user", "content": "Assess this architecture: ..."}]
)

Fine-Tuning Cost Analysis

Model	Training Cost	Inference Cost	Minimum Examples
GPT-4o mini	$0.30/M tokens	$0.60/$2.40 per M tokens	10 (50+ recommended)
GPT-4o	$2.50/M tokens	$3.75/$15 per M tokens	10 (50+ recommended)
Claude (via AWS)	Varies	Varies	Custom
Open-source (Llama 3)	GPU cost only	Self-hosted cost	1,000+

The Hybrid Approach

Production systems often combine all three:

┌─────────────────────────────────────────┐
│            User Query                    │
├─────────────────────────────────────────┤
│  1. Prompt Engineering                   │
│     - System prompt sets behavior/format │
│     - Few-shot examples for edge cases   │
├─────────────────────────────────────────┤
│  2. RAG                                  │
│     - Retrieve relevant documents        │
│     - Inject into context                │
├─────────────────────────────────────────┤
│  3. Fine-Tuned Model                     │
│     - Domain-specific weights            │
│     - Consistent output style            │
├─────────────────────────────────────────┤
│            Response                      │
└─────────────────────────────────────────┘

Decision Flowchart

Do you need the model to use proprietary/current data?
├── No → Prompt Engineering
└── Yes
    ├── Does the data change frequently?
    │   ├── Yes → RAG
    │   └── No → Fine-Tuning OR RAG
    ├── Do you need citations/sources?
    │   └── Yes → RAG
    ├── Do you need a specific output style/format?
    │   └── Yes → Fine-Tuning + RAG
    └── Budget < $5K?
        └── Yes → RAG (fine-tuning training costs add up)

Cost Comparison: Fine-Tuning vs RAG

Factor	Fine-Tuning	RAG
Upfront cost	$500-$50K (training compute)	$100-$1K (embedding + vectorDB setup)
Per-query cost	Lower (smaller model possible)	Higher (embedding + retrieval + generation)
Data update cost	Full retrain ($$$)	Re-embed changed documents ($)
Infrastructure	GPU for training + serving	Vector DB + embedding API + LLM API
Time to first result	Days-weeks (data prep + training + eval)	Hours-days (chunk + embed + prompt)
Maintenance	Periodic retraining	Index refresh pipeline

Hybrid Approach: When to Combine Both

The most sophisticated deployments combine fine-tuning AND RAG:

Fine-tune for style and domain vocabulary — Train the model to speak in your industry language, use your formatting conventions, and match your organization tone
RAG for factual grounding — Retrieve specific facts, policies, and data from your knowledge base at query time
Result: A model that sounds like your organization AND references accurate, current information

This hybrid approach is ideal for large enterprises with specific terminology (legal, medical, financial) where both accuracy and tone matter.

Implementation Checklist

Started with prompt engineering to establish baseline quality
Evaluated whether RAG solves the problem before fine-tuning
Chunking strategy tested (size, overlap, metadata)
Vector database selected and indexed
Retrieval quality measured (precision@k, recall@k)
Fine-tuning data prepared (if needed) — minimum 50 high-quality examples
Evaluation pipeline built (automated + human review)
Cost per query calculated for production volume
Monitoring for answer quality degradation

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For AI strategy consulting, visit garnetgrid.com. :::