Fine-tuning LLMs Locally: Ollama + Unsloth in Practice

Simi included in AI

2025-12-24 588 words 3 minutes

Contents

The barrier to fine-tuning LLMs has dropped dramatically—a consumer GPU, Ollama, and Unsloth is all it takes to run a customized model. But most scenarios don’t need fine-tuning; this article helps you decide if yours does.

When Fine-tuning Is Worth It

Fine-tuning is expensive. Ask yourself before starting:

Question 1: Can prompt engineering solve this?
  → Yes: don't fine-tune

Question 2: Can RAG solve this?
  → Yes: don't fine-tune

Question 3: Is the model's base capability insufficient?
  → Yes: consider fine-tuning

Worth fine-tuning for:

Role/tone: specific response style
Fixed output format: API response format
Vertical domain terminology: medical, legal, finance professional vocabulary
Private knowledge: company internal concepts, processes

Not worth fine-tuning for:

Introducing new knowledge (RAG better)
Fixing hallucinations (RAG + fact-checking)
Improving reasoning (stronger base model more worthwhile)

Tool Selection

Ollama + Modelfile

Ollama supports tuning via Modelfile:

        
        
        
    
# Create tuning config
FROM llama3.2
PARAMETER temperature 0.7
SYSTEM """
You are a senior Go engineer.
Respond in Chinese, code examples in Go.
"""

This isn’t real fine-tuning, but System Prompt enough for tone tuning.

Unsloth (Real Fine-tuning)

For real fine-tuning, use Unsloth:

        
# Install
pip install unsloth

# Supported models
# - Llama 3.2 (1B, 3B, 8B, 70B)
# - Phi-3.5 (3.8B)
# - Gemma 2 (2B, 9B, 27B)

Unsloth’s advantage: QLoRA fine-tuning, 24GB GPU VRAM can fine-tune 70B model.

Data Preparation

Data Format

        
{"text": "<|user|>\nHelp me write a FastAPI endpoint\n<|assistant|>\nfrom fastapi import FastAPI\n\napp = FastAPI()\n\[email protected](\"/users/{user_id}\")\ndef get_user(user_id: int):\n    return {\"id\": user_id}"}

Key points:

user/assistant role markers clear
each sample complete and independent
1000-10000 samples enough, don’t blindly pile on quantity

Data Quality » Quantity

        
# Bad data (10000)
{"text": "help me write code\nwrite code"}
{"text": "explain this\nthat"}

# Good data (1000)
{"text": "<|user|>...<|assistant|>..."}  # each complete, meaningful

Real result: 500 high-quality samples > 10000 noisy samples.

Training Workflow

Complete Workflow

        
        
        
    
from unsloth import FastLanguageModel
import torch

# 1. Load model (4-bit quantization)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-3.2-3B",
    max_seq_length=2048,
    load_in_4bit=True,
)

# 2. Add LoRA adapter
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_alpha=16,
    lora_dropout=0,
)

# 3. Train
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=10,
        max_steps=100,
        fp16=not torch.cuda.is_bf16_supported(),
        logging_steps=10,
    ),
)

trainer.train()

Inference

        
FastLanguageModel.for_inference(model)
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)

Common Pitfalls

1. Overfitting (Most Common)

        
# Symptom: after training, model perfect on training data, poor generalization
# Solution:
# - reduce training steps
# - increase data diversity
# - lower LoRA r value

# Check: watch validation loss not just train loss

2. Catastrophic Forgetting

        
# Symptom: after fine-tuning, model loses original capabilities (e.g., Chinese understanding)
# Solution:
# - use DPO (Direct Preference Optimization) instead of pure SFT
# - mix general data into training data (20-30%)

3. Training Instability

        
# Symptom: loss diverges, NaN
# Solution:
# - lower learning rate (2e-4 → 5e-5)
# - increase warmup steps
# - check data format issues

What GPU Configs Can Run

GPU	Fine-tunable Models	Time (1000 steps)
RTX 4070 8GB	Llama 3.2 1B / Phi-3.5 3B	~3 hours
RTX 4080 12GB	Llama 3.2 3B / Gemma 2 2B	~2 hours
RTX 4090 24GB	Llama 3.2 8B / Gemma 2 9B	~3 hours
A100 40GB	Llama 3.2 70B	~1 hour

Conclusion

Local fine-tuning is now accessible. RTX 4080+ can play.

But before fine-tuning, ask yourself: are prompt engineering and RAG really insufficient? Fine-tuning costs (time + data prep + iteration) are often higher than expected.

Truly worth-fine-tuning scenarios are few, but when used right, the effect is significant.