Verified by Garnet Grid

LLM Fine-Tuning vs RAG vs Prompt Engineering: Decision Guide

Choose the right approach for customizing large language models. Covers when to use fine-tuning, RAG, or prompt engineering, with cost analysis, implementation complexity, and decision framework.

You have proprietary data and you want an LLM to use it. Three approaches exist: prompt engineering (cheapest), RAG (most flexible), and fine-tuning (most powerful). Most teams default to fine-tuning because it sounds impressive. Most teams should start with RAG because it actually works for their use case. This guide helps you choose.


The Decision Matrix

FactorPrompt EngineeringRAGFine-Tuning
Implementation timeHoursDays–WeeksWeeks–Months
Cost$0 (API usage only)Low–MediumHigh
Data required0 examplesDocuments/knowledge base1,000–100,000+ examples
Customization depthBehavior & formatKnowledge & contextBehavior, tone, & knowledge
LatencyLowestMedium (retrieval + generation)Low (no retrieval step)
MaintenanceMinimalIndex updatesRetraining on new data
Hallucination riskHighLow (grounded in sources)Medium

Start Here: Prompt Engineering

The simplest and most underestimated approach. Before building infrastructure, try these techniques:

System Prompts

system_prompt = """
You are a senior financial analyst at Garnet Grid Consulting.
You specialize in enterprise technology cost analysis.

Rules:
- Always cite specific numbers when discussing costs
- Use bullet points for comparisons
- If you don't know something, say "I don't have enough data"
- Never provide legal or compliance advice
- Format currency as USD with commas (e.g., $1,250,000)
"""

Few-Shot Examples

few_shot_prompt = """
Given a cloud infrastructure description, estimate monthly costs.

Example 1:
Input: "3 EC2 m5.xlarge instances running 24/7 in us-east-1"
Output: Estimated monthly cost: $462 ($154/instance × 3)
- On-demand: $462/mo
- 1-year reserved (no upfront): $296/mo (36% savings)
- 3-year reserved (all upfront): $190/mo (59% savings)

Example 2:
Input: "500GB S3 storage with 1M GET requests/month"
Output: Estimated monthly cost: $15.50
- Storage: $11.50 (500GB × $0.023/GB)
- Requests: $0.40 (1M × $0.0004/1000)
- Data transfer: ~$4 (estimate)

Now analyze:
Input: "{user_input}"
"""

Chain-of-Thought

cot_prompt = """
Analyze this architecture decision step by step:

Step 1: Identify the core requirements
Step 2: List the constraints (budget, timeline, team skills)
Step 3: Evaluate each option against requirements
Step 4: Identify risks for each option
Step 5: Make a recommendation with justification

Architecture decision: {user_input}
"""

When prompt engineering is enough:

  • You need the LLM to follow a specific format or tone
  • The task requires general knowledge (not proprietary data)
  • You have < 20 “rules” the model needs to follow
  • Response quality is acceptable with the right prompt structure

Level Up: RAG (Retrieval-Augmented Generation)

RAG connects your LLM to external knowledge. Instead of training the model on your data, you retrieve relevant documents at query time and include them in the context.

Architecture

User Query → Embedding Model → Vector Search → Top-K Documents

                                        LLM Prompt + Retrieved Context

                                              Generated Answer

Implementation

from openai import OpenAI
import pinecone

client = OpenAI()

# 1. Embed the query
query = "What is our SLA for enterprise customers?"
query_embedding = client.embeddings.create(
    input=query, model="text-embedding-3-small"
).data[0].embedding

# 2. Search vector database for relevant documents
index = pinecone.Index("company-docs")
results = index.query(vector=query_embedding, top_k=5, include_metadata=True)

# 3. Build context from retrieved documents
context = "\n\n".join([
    f"Source: {r.metadata['source']}\n{r.metadata['text']}"
    for r in results.matches
])

# 4. Generate answer grounded in retrieved context
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": """Answer based ONLY on the provided context.
         If the context doesn't contain the answer, say 'Not found in documentation.'
         Always cite the source document."""},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
    ]
)

Chunking Strategies

StrategyChunk SizeBest For
Fixed-size500–1000 tokensGeneral documents
Paragraph-basedVariableWell-structured docs
SemanticVariableMixed-format content
Recursive character500–1500 charsCode + docs
Sentence window3–5 sentencesPrecise Q&A
# Recursive character splitting with overlap
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_text(document_text)

When to Use RAG

  • Your knowledge base changes frequently (docs, policies, FAQs)
  • You need answers grounded in specific source documents
  • You need citation/attribution for compliance
  • Your dataset is too large for the context window
  • You want to avoid fine-tuning costs and complexity

Full Power: Fine-Tuning

Fine-tuning modifies the model’s weights by training on your domain-specific data. This changes what the model knows and how it behaves.

When Fine-Tuning Is Worth It

ScenarioWhy Fine-Tuning Helps
Consistent output formatTrain on 1000+ examples of desired format
Domain-specific terminologyModel learns jargon without prompt explanation
Classification tasksModel learns your specific taxonomy
Style/tone matchingReplicate a specific writing voice
Reducing prompt lengthKnowledge is in weights, not the prompt
Edge casesHandle nuanced scenarios prompt engineering can’t

Data Preparation

{"messages": [
  {"role": "system", "content": "You are Garnet Grid's architecture assessment AI."},
  {"role": "user", "content": "Assess this architecture: monolithic Java app, Oracle DB, on-prem VMware"},
  {"role": "assistant", "content": "**Architecture Risk Score: 7/10 (High)**\n\n**Risks:**\n- Vendor lock-in: Oracle licensing + VMware dependency\n- Scalability: Monolithic deployment limits horizontal scaling\n- Talent: Shrinking Java/Oracle talent pool\n\n**Recommendations:**\n1. Strangler fig migration to microservices (6-12 months)\n2. Database: Migrate to PostgreSQL (3-month effort)\n3. Infrastructure: Containerize with Docker, deploy to AKS\n\n**Estimated migration cost:** $350K-$500K over 12 months"}
]}

Fine-Tuning Process

from openai import OpenAI
client = OpenAI()

# 1. Upload training data
file = client.files.create(
    file=open("training_data.jsonl", "rb"),
    purpose="fine-tune"
)

# 2. Create fine-tuning job
job = client.fine_tuning.jobs.create(
    training_file=file.id,
    model="gpt-4o-mini-2024-07-18",
    hyperparameters={
        "n_epochs": 3,
        "batch_size": 4,
        "learning_rate_multiplier": 1.0
    }
)

# 3. Monitor progress
status = client.fine_tuning.jobs.retrieve(job.id)
print(f"Status: {status.status}")  # queued → running → succeeded

# 4. Use fine-tuned model
response = client.chat.completions.create(
    model="ft:gpt-4o-mini-2024-07-18:garnet-grid:arch-assess:abc123",
    messages=[{"role": "user", "content": "Assess this architecture: ..."}]
)

Fine-Tuning Cost Analysis

ModelTraining CostInference CostMinimum Examples
GPT-4o mini$0.30/M tokens$0.60/$2.40 per M tokens10 (50+ recommended)
GPT-4o$2.50/M tokens$3.75/$15 per M tokens10 (50+ recommended)
Claude (via AWS)VariesVariesCustom
Open-source (Llama 3)GPU cost onlySelf-hosted cost1,000+

The Hybrid Approach

Production systems often combine all three:

┌─────────────────────────────────────────┐
│            User Query                    │
├─────────────────────────────────────────┤
│  1. Prompt Engineering                   │
│     - System prompt sets behavior/format │
│     - Few-shot examples for edge cases   │
├─────────────────────────────────────────┤
│  2. RAG                                  │
│     - Retrieve relevant documents        │
│     - Inject into context                │
├─────────────────────────────────────────┤
│  3. Fine-Tuned Model                     │
│     - Domain-specific weights            │
│     - Consistent output style            │
├─────────────────────────────────────────┤
│            Response                      │
└─────────────────────────────────────────┘

Decision Flowchart

Do you need the model to use proprietary/current data?
├── No → Prompt Engineering
└── Yes
    ├── Does the data change frequently?
    │   ├── Yes → RAG
    │   └── No → Fine-Tuning OR RAG
    ├── Do you need citations/sources?
    │   └── Yes → RAG
    ├── Do you need a specific output style/format?
    │   └── Yes → Fine-Tuning + RAG
    └── Budget < $5K?
        └── Yes → RAG (fine-tuning training costs add up)

Cost Comparison: Fine-Tuning vs RAG

FactorFine-TuningRAG
Upfront cost$500-$50K (training compute)$100-$1K (embedding + vectorDB setup)
Per-query costLower (smaller model possible)Higher (embedding + retrieval + generation)
Data update costFull retrain ($$$)Re-embed changed documents ($)
InfrastructureGPU for training + servingVector DB + embedding API + LLM API
Time to first resultDays-weeks (data prep + training + eval)Hours-days (chunk + embed + prompt)
MaintenancePeriodic retrainingIndex refresh pipeline

Hybrid Approach: When to Combine Both

The most sophisticated deployments combine fine-tuning AND RAG:

  1. Fine-tune for style and domain vocabulary — Train the model to speak in your industry language, use your formatting conventions, and match your organization tone
  2. RAG for factual grounding — Retrieve specific facts, policies, and data from your knowledge base at query time
  3. Result: A model that sounds like your organization AND references accurate, current information

This hybrid approach is ideal for large enterprises with specific terminology (legal, medical, financial) where both accuracy and tone matter.


Implementation Checklist

  • Started with prompt engineering to establish baseline quality
  • Evaluated whether RAG solves the problem before fine-tuning
  • Chunking strategy tested (size, overlap, metadata)
  • Vector database selected and indexed
  • Retrieval quality measured (precision@k, recall@k)
  • Fine-tuning data prepared (if needed) — minimum 50 high-quality examples
  • Evaluation pipeline built (automated + human review)
  • Cost per query calculated for production volume
  • Monitoring for answer quality degradation

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For AI strategy consulting, visit garnetgrid.com. :::

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →