LLM Fine-tuning to Production: QLoRA and RLHF in Practice
Contents
QLoRA Config
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/llama-3.2-3B",
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "v_proj"],
)RLHF Pipeline
# 1. SFT (supervised fine-tuning)
trainer = SFTTrainer(model=model, ...)
trainer.train()
# 2. Reward Modeling
# train a reward model to distinguish good/bad answers
# 3. RLHF
# use PPO to optimize LLM maximizing rewardProduction Notes
✅ use QLoRA to save VRAM
✅ prepare 1000+ high quality samples
✅ use RLHF not pure SFT
❌ don't over-train (watch val loss)
❌ don't use noisy dataConclusion
QLoRA + RLHF = production-grade fine-tuning solution.
RAG is better value for most scenarios.