Fine-tuning is the most misunderstood concept in applied AI. Teams rush to fine-tune when prompt engineering would work better, or they fine-tune on irrelevant data and get worse results than the base model. This guide covers when fine-tuning actually makes sense and how to do it well.
When to Fine-Tune vs Prompt Engineer
| Scenario | Approach | Why |
|---|
| Follow specific output format | Prompt engineer | Faster, cheaper, adjustable |
| Use domain knowledge | RAG (Retrieval Augmented Generation) | Knowledge can be updated without retraining |
| Match specific writing style/tone | Fine-tune | Style is hard to specify in prompts |
| Reduce token usage (shorter prompts) | Fine-tune | Model learns task, needs less instruction |
| Handle proprietary vocabulary | Fine-tune | Domain terms not in base model |
| Improve speed (fewer prompt tokens) | Fine-tune | Shorter prompts = faster inference |
| General question answering | Don’t fine-tune | Base models are already excellent |
Fine-Tuning Decision Tree
Does the task require specific domain knowledge?
├── Yes → Can the knowledge be retrieved at inference time?
│ ├── Yes → Use RAG (retrieval augmented generation)
│ └── No → Consider fine-tuning (or RAG + fine-tune)
│
└── No → Is the issue output format or style?
├── Yes → Can it be solved with better prompts?
│ ├── Yes → Prompt engineering (cheaper, faster)
│ └── No → Fine-tune for style/format
│
└── No → Is it a latency/cost issue?
├── Yes → Fine-tune smaller model to replace large one
└── No → Base model is probably sufficient
LoRA / QLoRA
| Method | GPU Memory | Training Time | Quality | Cost |
|---|
| Full fine-tuning | Very high (80GB+) | Long | Highest | $$$$ |
| LoRA (Low-Rank Adaptation) | Medium (24GB) | Medium | Near-full | $$ |
| QLoRA (Quantized LoRA) | Low (16GB) | Medium | Good | $ |
| Prompt tuning | Very low | Fast | Moderate | $ |
from peft import LoraConfig, get_peft_model
# LoRA configuration
lora_config = LoraConfig(
r=16, # Rank (lower = less params, faster)
lora_alpha=32, # Scaling factor
target_modules=[ # Which layers to adapt
"q_proj", "v_proj", "k_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
# Apply LoRA to base model
model = get_peft_model(base_model, lora_config)
print(f"Trainable params: {model.num_parameters(only_trainable=True):,}")
# ~0.1% of total parameters
Training Data Preparation
| Quality Factor | Bad | Good |
|---|
| Volume | 10 examples | 500-5,000 examples |
| Diversity | Same type of example repeated | Varied examples covering edge cases |
| Quality | Unreviewed, noisy data | Human-reviewed, consistent quality |
| Format | Inconsistent formatting | Standardized instruction-response pairs |
| Balance | 90% one category | Proportional representation |
Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|
| Fine-tune for general knowledge | Model still hallucinates, waste of compute | RAG for factual knowledge, fine-tune for behavior |
| Tiny training set (< 100 examples) | Overfitting, no improvement | Minimum 500 high-quality examples |
| No evaluation methodology | Can’t tell if fine-tuned model is better | Eval set, automated metrics, human eval |
| Fine-tune largest model | Expensive, diminishing returns | Start with smallest model that works |
| No comparison to base + prompt | Don’t know if fine-tuning was even needed | Always benchmark: base + prompt vs fine-tuned |
Checklist
:::note[Source]
This guide is derived from operational intelligence at Garnet Grid Consulting. For AI/ML consulting, visit garnetgrid.com.
:::
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting
Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.
View Full Profile →