LLM Fine-Tuning
Fine-tune large language models for domain-specific tasks. Covers full fine-tuning, LoRA, QLoRA, dataset preparation, evaluation, deployment, and the patterns that produce specialized models without the cost of training from scratch.
Fine-tuning adapts a pre-trained LLM to a specific task or domain by training on your data. Instead of building a model from scratch (millions of dollars, months of compute), you take an existing model and teach it your domain in hours with hundreds of examples. The result is a model that speaks your language, follows your conventions, and handles your edge cases.
When to Fine-Tune
Don't fine-tune (use prompting instead):
☐ Few-shot examples solve the problem
☐ Task is general-purpose (translation, summarization)
☐ You have < 100 training examples
☐ Requirements change frequently
Fine-tune when:
☐ Prompting consistently fails on your domain
☐ You need specific output format/style
☐ Latency matters (fine-tuned = shorter prompts)
☐ Cost matters (shorter prompts = cheaper inference)
☐ You have 500+ high-quality training examples
☐ Domain-specific terminology and patterns
Fine-Tuning Methods
Full Fine-Tuning:
Update all model parameters
Cost: Very high (full model in GPU memory)
Quality: Best results
When: You have the compute budget and large dataset
LoRA (Low-Rank Adaptation):
Freeze base model, add small trainable adapters
Cost: Low (< 1% of parameters trained)
Quality: 95%+ of full fine-tuning quality
When: Limited GPU memory, quick iteration
QLoRA (Quantized LoRA):
Quantize model to 4-bit, apply LoRA adapters
Cost: Very low (fits 70B model on single GPU)
Quality: 90-95% of full fine-tuning quality
When: Consumer GPU, largest models
Dataset Preparation
# Training data format (instruction fine-tuning)
training_data = [
{
"messages": [
{"role": "system", "content": "You are a medical coding assistant."},
{"role": "user", "content": "Patient presents with acute bronchitis, prescribed azithromycin."},
{"role": "assistant", "content": "ICD-10: J20.9 (Acute bronchitis, unspecified)\nCPT: 99213 (Office visit, established patient)\nNDC: 0069-3150-75 (Azithromycin 250mg)"}
]
},
{
"messages": [
{"role": "system", "content": "You are a medical coding assistant."},
{"role": "user", "content": "Follow-up for type 2 diabetes, HbA1c 7.2%, adjusted metformin."},
{"role": "assistant", "content": "ICD-10: E11.65 (Type 2 diabetes with hyperglycemia)\nCPT: 99214 (Office visit, moderate complexity)\nLab: 83036 (HbA1c)\nNDC: 0093-7214-01 (Metformin 500mg)"}
]
},
# ... 500+ examples
]
# Data quality checklist:
# ☐ Diverse examples covering edge cases
# ☐ Consistent format in all outputs
# ☐ Verified by domain experts
# ☐ No PII or sensitive data
# ☐ Balanced across categories
# ☐ 80/10/10 train/validation/test split
LoRA Fine-Tuning
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from trl import SFTTrainer
# Load base model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-8b",
torch_dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8b")
# LoRA configuration
lora_config = LoraConfig(
r=16, # Rank (higher = more capacity, more memory)
lora_alpha=32, # Scaling factor
lora_dropout=0.05, # Regularization
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, lora_config)
print(f"Trainable params: {model.num_parameters(only_trainable=True):,}")
# Trainable: ~6.5M (vs 8B total = 0.08%)
# Training
trainer = SFTTrainer(
model=model,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
args=TrainingArguments(
output_dir="./fine-tuned-model",
num_train_epochs=3,
per_device_train_batch_size=4,
learning_rate=2e-4,
warmup_steps=100,
logging_steps=10,
eval_strategy="steps",
eval_steps=50,
save_strategy="steps",
save_steps=100,
),
)
trainer.train()
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| Fine-tune before trying prompting | Wasted time and compute | Start with few-shot prompting |
| Low-quality training data | Model learns bad patterns | Expert-verified, diverse examples |
| No evaluation dataset | Cannot measure improvement | Hold out 10% for evaluation |
| Overfitting on small dataset | Works on training data, fails on new data | LoRA dropout, early stopping |
| Not merging adapters for production | Inference overhead from adapter loading | Merge LoRA into base model |
Fine-tuning is powerful but not magic. It works best when you have high-quality training data, a clear task definition, and have already tried prompting first.