LLM Fine-tuning to Production: QLoRA and RLHF in Practice
Training is only half the battle. Once your fine-tuned model hits a good validation loss, the harder challenge begins: serving it reliably in production, tracking its behavior, and managing model versions as you continue training.
Division of scope with another post on this blog: Local LLM Fine-tuning Guide covers the training pipeline — how to run QLoRA training locally with Ollama + Unsloth. This article focuses on what comes after training: production deployment.
Training Done — What’s Next?
Fine-tuning produces a .safetensors weights file (or a merged Llama/Mistral format). But production needs:
- Serving layer: An inference server that handles concurrent requests efficiently
- Version management: Record which model version maps to which training data and config
- A/B testing: How to safely roll out a new version without impacting existing users
- Monitoring: Has the model’s behavior drifted? Is quality degrading?
- Rollback: Ability to switch back to the previous version in minutes when things go wrong
Each of these is an independent engineering challenge. Training can be solved with a script; production deployment needs infrastructure.
Model Serving Layer
Option 1: vLLM (Preferred for GPU Serving)
vLLM is the mainstream solution for GPU deployment of open-source LLMs. Its core advantages are PagedAttention and Continuous Batching:
- PagedAttention: Manages KV cache in pages, significantly reducing memory fragmentation and enabling larger batch sizes
- Continuous Batching: Processes requests as they arrive rather than waiting for a full batch, reducing P95 latency
# Deploy your fine-tuned model with vLLM
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model ./my-finetuned-model \
--tensor-parallel-size 2 \
--max-model-len 4096 \
--port 8000vLLM exposes an OpenAI-compatible API, so existing OpenAI SDK code works without changes — just swap the base_url to point at your deployment.
Key metrics to expose and track:
- TTFT (Time to First Token): user-perceived responsiveness
- Token Throughput (tokens/second): overall serving capacity
- P95 Latency: tail latency, reflecting worst-case experience
Option 2: Ollama (Lightweight / Edge)
For small-scale deployments (< 10 concurrent requests) or edge devices, Ollama is simpler. It handles quantization and format conversion automatically:
# Convert fine-tuned model to Ollama format
ollama create my-model -f Modelfile
# Modelfile example
FROM ./merged-model-directory
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
SYSTEM "You are a professional code review assistant."Option 3: HuggingFace Inference Endpoints
HuggingFace Inference Endpoints is a managed option suitable when you don’t want to run your own GPU cluster. Upload the model to HuggingFace Hub, get an API endpoint in minutes, with auto-scaling built in. The tradeoff is higher cost than self-hosted, and data privacy requires additional consideration.
Model Version Management
Core principle: never overwrite a model version in production. Every training run creates a new version; old versions stay available as hot standby.
Tracking Versions with MLflow
MLflow records complete metadata for each training run:
import mlflow
with mlflow.start_run(run_name="llama3-v2-2025-12"):
# Log training config
mlflow.log_params({
"base_model": "unsloth/llama-3.2-3B",
"dataset_version": "customer-support-v3",
"lora_r": 16,
"learning_rate": 2e-4,
"epochs": 3,
})
# Log training metrics
mlflow.log_metrics({
"val_loss": 1.23,
"eval_accuracy": 0.89,
})
# Register model version
mlflow.pyfunc.log_model(
artifact_path="model",
python_model=model_wrapper,
registered_model_name="customer-support-llm",
)Each registered version is automatically tagged with a timestamp and git commit hash for later traceability.
W&B Model Registry
W&B Model Registry is another mature option, especially good for team collaboration. It supports lifecycle management (candidate → staging → production) and diff comparison between versions.
Recommended version naming convention — include three dimensions:
{base_model}-{dataset_version}-{date}
# e.g.: llama3.2-3b-custsupp-v3-20251215A/B Testing: Safe Model Rollout
Replacing the production model directly is the riskiest approach. The right method is phased traffic rollout:
Shadow Mode
New model runs alongside old model; new model output is logged only, never returned to users:
async def process_request(prompt: str) -> str:
# Primary path: respond to user with old model
production_output = await old_model.generate(prompt)
# Shadow path: run new model async, log only, don't return
asyncio.create_task(
shadow_evaluate(prompt, production_output)
)
return production_output
async def shadow_evaluate(prompt: str, production_output: str):
new_output = await new_model.generate(prompt)
# Log both outputs for comparison analysis
await log_comparison(prompt, production_output, new_output)Run shadow mode for 24-48 hours, then analyze the collected data to evaluate quality differences before deciding whether to proceed with a real rollout.
Traffic Split Rollout
Week 1: 5% → monitor metrics → pass gate → continue
Week 2: 20% → monitor metrics → pass gate → continue
Week 3: 50% → monitor metrics → pass gate → continue
Week 4: 100% (rollout complete)Gate criteria at each stage:
| Metric | Requirement |
|---|---|
| Task completion rate | New model ≥ old model - 2% |
| P95 latency | New model ≤ old model + 20% |
| Error rate | New model ≤ old model + 0.5% |
| User satisfaction (if tracked) | New model ≥ old model - 5% |
If any metric misses its gate, pause the rollout and analyze before continuing.
RLHF in Production Practice
Production is the best source of RLHF data. Real user feedback is more valuable than hand-labeled data because it reflects actual use cases and preferences.
Collecting Production Preference Data
The simplest implementation: add 👍/👎 buttons under responses and log the feedback:
# Collect preference data
preference_data = {
"prompt": user_prompt,
"chosen": response_user_liked, # response user thumbed up
"rejected": response_user_disliked, # response user thumbed down or ignored
}Once you’ve accumulated a few thousand preference pairs, use DPO (Direct Preference Optimization) to fine-tune — it’s more stable than traditional PPO and doesn’t require training a separate reward model:
from trl import DPOTrainer, DPOConfig
training_args = DPOConfig(
beta=0.1, # KL divergence penalty coefficient
output_dir="./dpo-output",
num_train_epochs=1,
per_device_train_batch_size=4,
learning_rate=5e-6,
logging_steps=10,
)
trainer = DPOTrainer(
model=model,
ref_model=ref_model, # original SFT model as reference
args=training_args,
train_dataset=preference_dataset,
tokenizer=tokenizer,
)
trainer.train()The TRL library provides implementations of DPO, PPO, GRPO, and other RLHF algorithms — it’s the most mature RLHF training toolkit available.
Monitoring and Rollback
Quality Monitoring
Traditional model monitoring tracks latency and error rates. LLMs also need output quality monitoring. The common approach in production is LLM-as-Judge:
async def evaluate_output_quality(prompt: str, response: str) -> float:
"""Use another LLM to evaluate output quality; return 0-1 score"""
eval_prompt = f"""
Evaluate the quality of the following response on a scale of 0-10.
Return only the number.
Question: {prompt}
Response: {response}
Criteria: accuracy, relevance, completeness, fluency
"""
score_text = await judge_model.generate(eval_prompt)
return float(score_text.strip()) / 10Sample 5-10% of production traffic, auto-evaluate each sampled output, and build a quality trend chart. If quality scores drop below a threshold (e.g., more than 5% below baseline for 3 consecutive hours), fire an alert.
Drift Detection
Beyond quality scores, monitor output distribution changes:
- Is the output length distribution shifting?
- Are certain response types (refusals, uncertainty expressions) increasing?
- Are keyword or topic distributions drifting?
These metrics changing suddenly are often early signals of model behavior change — faster to catch than waiting for user complaints.
Rollback Playbook
Rollback must complete in under 5 minutes:
# Quick model version switch with vLLM + Nginx
# 1. Confirm old version instance is still running (hot standby)
curl http://old-model-service:8000/health
# 2. Switch Nginx upstream (config pre-updated to point at old version)
nginx -s reload
# 3. Verify traffic has switched
curl http://api.example.com/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "customer-support", "messages": [{"role": "user", "content": "test"}]}'
# 4. After confirming new version stopped receiving traffic,
# document the issue and schedule a postmortemThe key is keeping the old version in hot standby at all times — don’t release its resources until the switch is fully complete and verified.
Training a model is the easy part; productionizing it is the real engineering challenge. The workflow above isn’t built all at once. Start with a basic vLLM deployment, then layer in version management, A/B testing, and monitoring incrementally — stabilize each stage before moving to the next.