Contents

LLM Fine-tuning to Production: QLoRA and RLHF in Practice

QLoRA Config

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-3.2-3B",
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "v_proj"],
)

RLHF Pipeline

# 1. SFT (supervised fine-tuning)
trainer = SFTTrainer(model=model, ...)
trainer.train()

# 2. Reward Modeling
# train a reward model to distinguish good/bad answers

# 3. RLHF
# use PPO to optimize LLM maximizing reward

Production Notes

✅ use QLoRA to save VRAM
✅ prepare 1000+ high quality samples
✅ use RLHF not pure SFT
❌ don't over-train (watch val loss)
❌ don't use noisy data

Conclusion

QLoRA + RLHF = production-grade fine-tuning solution.

RAG is better value for most scenarios.