LLM Fine-tuning to Production: QLoRA and RLHF in Practice

Simi included in AI

2025-12-25 1252 words 6 minutes

Contents

Training is only half the battle. Once your fine-tuned model hits a good validation loss, the harder challenge begins: serving it reliably in production, tracking its behavior, and managing model versions as you continue training.

Division of scope with another post on this blog: Local LLM Fine-tuning Guide covers the training pipeline — how to run QLoRA training locally with Ollama + Unsloth. This article focuses on what comes after training: production deployment.

Training Done — What’s Next?

Fine-tuning produces a .safetensors weights file (or a merged Llama/Mistral format). But production needs:

Serving layer: An inference server that handles concurrent requests efficiently
Version management: Record which model version maps to which training data and config
A/B testing: How to safely roll out a new version without impacting existing users
Monitoring: Has the model’s behavior drifted? Is quality degrading?
Rollback: Ability to switch back to the previous version in minutes when things go wrong

Each of these is an independent engineering challenge. Training can be solved with a script; production deployment needs infrastructure.

Model Serving Layer

Option 1: vLLM (Preferred for GPU Serving)

vLLM is the mainstream solution for GPU deployment of open-source LLMs. Its core advantages are PagedAttention and Continuous Batching:

PagedAttention: Manages KV cache in pages, significantly reducing memory fragmentation and enabling larger batch sizes
Continuous Batching: Processes requests as they arrive rather than waiting for a full batch, reducing P95 latency

        
# Deploy your fine-tuned model with vLLM
pip install vllm

python -m vllm.entrypoints.openai.api_server \
    --model ./my-finetuned-model \
    --tensor-parallel-size 2 \
    --max-model-len 4096 \
    --port 8000

vLLM exposes an OpenAI-compatible API, so existing OpenAI SDK code works without changes — just swap the base_url to point at your deployment.

Key metrics to expose and track:

TTFT (Time to First Token): user-perceived responsiveness
Token Throughput (tokens/second): overall serving capacity
P95 Latency: tail latency, reflecting worst-case experience

Option 2: Ollama (Lightweight / Edge)

For small-scale deployments (< 10 concurrent requests) or edge devices, Ollama is simpler. It handles quantization and format conversion automatically:

        
# Convert fine-tuned model to Ollama format
ollama create my-model -f Modelfile

# Modelfile example
FROM ./merged-model-directory
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
SYSTEM "You are a professional code review assistant."

Option 3: HuggingFace Inference Endpoints

HuggingFace Inference Endpoints is a managed option suitable when you don’t want to run your own GPU cluster. Upload the model to HuggingFace Hub, get an API endpoint in minutes, with auto-scaling built in. The tradeoff is higher cost than self-hosted, and data privacy requires additional consideration.

Model Version Management

Core principle: never overwrite a model version in production. Every training run creates a new version; old versions stay available as hot standby.

Tracking Versions with MLflow

MLflow records complete metadata for each training run:

        
        
        
    
import mlflow

with mlflow.start_run(run_name="llama3-v2-2025-12"):
    # Log training config
    mlflow.log_params({
        "base_model": "unsloth/llama-3.2-3B",
        "dataset_version": "customer-support-v3",
        "lora_r": 16,
        "learning_rate": 2e-4,
        "epochs": 3,
    })

    # Log training metrics
    mlflow.log_metrics({
        "val_loss": 1.23,
        "eval_accuracy": 0.89,
    })

    # Register model version
    mlflow.pyfunc.log_model(
        artifact_path="model",
        python_model=model_wrapper,
        registered_model_name="customer-support-llm",
    )

Each registered version is automatically tagged with a timestamp and git commit hash for later traceability.

W&B Model Registry

W&B Model Registry is another mature option, especially good for team collaboration. It supports lifecycle management (candidate → staging → production) and diff comparison between versions.

Recommended version naming convention — include three dimensions:

{base_model}-{dataset_version}-{date}
# e.g.: llama3.2-3b-custsupp-v3-20251215

A/B Testing: Safe Model Rollout

Replacing the production model directly is the riskiest approach. The right method is phased traffic rollout:

Shadow Mode

New model runs alongside old model; new model output is logged only, never returned to users:

        
        
        
    
async def process_request(prompt: str) -> str:
    # Primary path: respond to user with old model
    production_output = await old_model.generate(prompt)

    # Shadow path: run new model async, log only, don't return
    asyncio.create_task(
        shadow_evaluate(prompt, production_output)
    )

    return production_output

async def shadow_evaluate(prompt: str, production_output: str):
    new_output = await new_model.generate(prompt)
    # Log both outputs for comparison analysis
    await log_comparison(prompt, production_output, new_output)

Run shadow mode for 24-48 hours, then analyze the collected data to evaluate quality differences before deciding whether to proceed with a real rollout.

Traffic Split Rollout

Week 1: 5%  → monitor metrics → pass gate → continue
Week 2: 20% → monitor metrics → pass gate → continue
Week 3: 50% → monitor metrics → pass gate → continue
Week 4: 100% (rollout complete)

Gate criteria at each stage:

Metric	Requirement
Task completion rate	New model ≥ old model - 2%
P95 latency	New model ≤ old model + 20%
Error rate	New model ≤ old model + 0.5%
User satisfaction (if tracked)	New model ≥ old model - 5%

If any metric misses its gate, pause the rollout and analyze before continuing.

RLHF in Production Practice

Production is the best source of RLHF data. Real user feedback is more valuable than hand-labeled data because it reflects actual use cases and preferences.

Collecting Production Preference Data

The simplest implementation: add 👍/👎 buttons under responses and log the feedback:

        
        
        
    
# Collect preference data
preference_data = {
    "prompt": user_prompt,
    "chosen": response_user_liked,      # response user thumbed up
    "rejected": response_user_disliked, # response user thumbed down or ignored
}

Once you’ve accumulated a few thousand preference pairs, use DPO (Direct Preference Optimization) to fine-tune — it’s more stable than traditional PPO and doesn’t require training a separate reward model:

        
        
        
    
from trl import DPOTrainer, DPOConfig

training_args = DPOConfig(
    beta=0.1,           # KL divergence penalty coefficient
    output_dir="./dpo-output",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    learning_rate=5e-6,
    logging_steps=10,
)

trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,   # original SFT model as reference
    args=training_args,
    train_dataset=preference_dataset,
    tokenizer=tokenizer,
)

trainer.train()

The TRL library provides implementations of DPO, PPO, GRPO, and other RLHF algorithms — it’s the most mature RLHF training toolkit available.

Monitoring and Rollback

Quality Monitoring

Traditional model monitoring tracks latency and error rates. LLMs also need output quality monitoring. The common approach in production is LLM-as-Judge:

        
        
        
    
async def evaluate_output_quality(prompt: str, response: str) -> float:
    """Use another LLM to evaluate output quality; return 0-1 score"""
    eval_prompt = f"""
    Evaluate the quality of the following response on a scale of 0-10.
    Return only the number.

    Question: {prompt}
    Response: {response}

    Criteria: accuracy, relevance, completeness, fluency
    """
    score_text = await judge_model.generate(eval_prompt)
    return float(score_text.strip()) / 10

Sample 5-10% of production traffic, auto-evaluate each sampled output, and build a quality trend chart. If quality scores drop below a threshold (e.g., more than 5% below baseline for 3 consecutive hours), fire an alert.

Drift Detection

Beyond quality scores, monitor output distribution changes:

Is the output length distribution shifting?
Are certain response types (refusals, uncertainty expressions) increasing?
Are keyword or topic distributions drifting?

These metrics changing suddenly are often early signals of model behavior change — faster to catch than waiting for user complaints.

Rollback Playbook

Rollback must complete in under 5 minutes:

        
# Quick model version switch with vLLM + Nginx

# 1. Confirm old version instance is still running (hot standby)
curl http://old-model-service:8000/health

# 2. Switch Nginx upstream (config pre-updated to point at old version)
nginx -s reload

# 3. Verify traffic has switched
curl http://api.example.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "customer-support", "messages": [{"role": "user", "content": "test"}]}'

# 4. After confirming new version stopped receiving traffic,
#    document the issue and schedule a postmortem

The key is keeping the old version in hot standby at all times — don’t release its resources until the switch is fully complete and verified.

Training a model is the easy part; productionizing it is the real engineering challenge. The workflow above isn’t built all at once. Start with a basic vLLM deployment, then layer in version management, A/B testing, and monitoring incrementally — stabilize each stage before moving to the next.