Fine-tuning LLMs Locally: Ollama + Unsloth in Practice
Contents
When Fine-tuning Is Worth It
Fine-tuning is expensive. Ask yourself before starting:
Question 1: Can prompt engineering solve this?
→ Yes: don't fine-tune
Question 2: Can RAG solve this?
→ Yes: don't fine-tune
Question 3: Is the model's base capability insufficient?
→ Yes: consider fine-tuningWorth fine-tuning for:
- Role/tone: specific response style
- Fixed output format: API response format
- Vertical domain terminology: medical, legal, finance professional vocabulary
- Private knowledge: company internal concepts, processes
Not worth fine-tuning for:
- Introducing new knowledge (RAG better)
- Fixing hallucinations (RAG + fact-checking)
- Improving reasoning (stronger base model more worthwhile)
Tool Selection
Ollama + Modelfile
Ollama supports tuning via Modelfile:
# Create tuning config
FROM llama3.2
PARAMETER temperature 0.7
SYSTEM """
You are a senior Go engineer.
Respond in Chinese, code examples in Go.
"""This isn’t real fine-tuning, but System Prompt enough for tone tuning.
Unsloth (Real Fine-tuning)
For real fine-tuning, use Unsloth:
# Install
pip install unsloth
# Supported models
# - Llama 3.2 (1B, 3B, 8B, 70B)
# - Phi-3.5 (3.8B)
# - Gemma 2 (2B, 9B, 27B)Unsloth’s advantage: QLoRA fine-tuning, 24GB GPU VRAM can fine-tune 70B model.
Data Preparation
Data Format
{"text": "<|user|>\nHelp me write a FastAPI endpoint\n<|assistant|>\nfrom fastapi import FastAPI\n\napp = FastAPI()\n\[email protected](\"/users/{user_id}\")\ndef get_user(user_id: int):\n return {\"id\": user_id}"}Key points:
- user/assistant role markers clear
- each sample complete and independent
- 1000-10000 samples enough, don’t blindly pile on quantity
Data Quality » Quantity
# Bad data (10000)
{"text": "help me write code\nwrite code"}
{"text": "explain this\nthat"}
# Good data (1000)
{"text": "<|user|>...<|assistant|>..."} # each complete, meaningfulReal result: 500 high-quality samples > 10000 noisy samples.
Training Workflow
Complete Workflow
from unsloth import FastLanguageModel
import torch
# 1. Load model (4-bit quantization)
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/llama-3.2-3B",
max_seq_length=2048,
load_in_4bit=True,
)
# 2. Add LoRA adapter
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_alpha=16,
lora_dropout=0,
)
# 3. Train
from trl import SFTTrainer
from transformers import TrainingArguments
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=train_dataset,
dataset_text_field="text",
max_seq_length=2048,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=10,
max_steps=100,
fp16=not torch.cuda.is_bf16_supported(),
logging_steps=10,
),
)
trainer.train()Inference
FastLanguageModel.for_inference(model)
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)Common Pitfalls
1. Overfitting (Most Common)
# Symptom: after training, model perfect on training data, poor generalization
# Solution:
# - reduce training steps
# - increase data diversity
# - lower LoRA r value
# Check: watch validation loss not just train loss2. Catastrophic Forgetting
# Symptom: after fine-tuning, model loses original capabilities (e.g., Chinese understanding)
# Solution:
# - use DPO (Direct Preference Optimization) instead of pure SFT
# - mix general data into training data (20-30%)3. Training Instability
# Symptom: loss diverges, NaN
# Solution:
# - lower learning rate (2e-4 → 5e-5)
# - increase warmup steps
# - check data format issuesWhat GPU Configs Can Run
| GPU | Fine-tunable Models | Time (1000 steps) |
|---|---|---|
| RTX 4070 8GB | Llama 3.2 1B / Phi-3.5 3B | ~3 hours |
| RTX 4080 12GB | Llama 3.2 3B / Gemma 2 2B | ~2 hours |
| RTX 4090 24GB | Llama 3.2 8B / Gemma 2 9B | ~3 hours |
| A100 40GB | Llama 3.2 70B | ~1 hour |
Conclusion
Local fine-tuning is now accessible. RTX 4080+ can play.
But before fine-tuning, ask yourself: are prompt engineering and RAG really insufficient? Fine-tuning costs (time + data prep + iteration) are often higher than expected.
truly worth-fine-tuning scenarios are few, but when used right, the effect is significant.