vLLM in Action: How to Run Open-Source LLMs Efficiently on GPU Servers
Why vLLM Matters
Everyone who’s run open-source LLMs has hit this problem: model loads into GPU memory, VRAM usage is absurdly high, but GPU utilization stays low.
Traditional LLM serving’s problem:
Input: "Hello, how are"
Token: [0, 1, 2, 3, ...] → [0, 1, 2, 3, 4, 5, ...]
↓
Attention Calculation
↓
Every token needs access to full KV cache
↓
KV cache pre-allocated as fixed-size contiguous blocks per request
↓
Massive memory fragmentation + low GPU utilizationRoot cause: traditional approach pre-allocates fixed-length continuous VRAM for each request’s KV cache. But actual generation lengths vary wildly across requests, causing massive waste.
vLLM’s Core: PagedAttention
vLLM comes from UC Berkeley’s research team, released June 2023. Their core innovation is PagedAttention—inspired by OS virtual memory paging.
How It Works
Traditional approach:
Request 1: [Token 1][Token 2][Token 3][Empty...Empty] (pre-allocated 512 slots)
Request 2: [Token 1][Token 2][Empty...Empty...Empty] (pre-allocated 512 slots)PagedAttention:
KV Cache: [Block 0][Block 1][Block 2][Block 3][Block 4][Block 5][Block 6][Block 7]
───────────────────────────────────────────────────────────────
Request 1: P1 | P1 | P1 | P1 |
Request 2: P2 | P2 | | |Each request’s KV cache is no longer a contiguous chunk but scattered across multiple blocks, allocated on-demand. Like OS virtual memory paging.
Real-World Impact
From vLLM paper and our testing:
| Metric | Traditional | vLLM | Improvement |
|---|---|---|---|
| Throughput (tokens/s) | baseline | +24x | 24x |
| Memory usage | 100% | ~60% | 40% saved |
| Concurrent requests | 1-2 | 10-20 | 10x |
Deployment Guide
Environment Setup
# CUDA 12.1+ required
nvidia-smi # Confirm GPU available
# Docker deployment recommended
docker pull nvidia/cuda:12.1.0-runtime-ubuntu22.04Install vLLM
# Option 1: pip install
pip install vllm
# Option 2: build from source (slower but latest optimizations)
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .Start API Server
# Minimal startup
vllm serve meta-llama/Llama-2-7b-chat-hf \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 2 # if multi-GPU# Production config
vllm serve meta-llama/Llama-2-70b-chat-hf \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.92 \
--max-model-len 8192 \
--enforce-eager \
--trust-remote-codeUsage Example
# OpenAI-compatible API
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"messages": [{"role": "user", "content": "Explain PagedAttention in one sentence."}],
"max_tokens": 100
}'Python client:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY" # No API key by default
)
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is vLLM?"}
],
max_tokens=200,
temperature=0.7
)
print(response.choices[0].message.content)Hardware Recommendations
Single Card (7B models)
# 7B runs fine on single card
vllm serve llama-2-7b \
--gpu-memory-utilization 0.9 \
--max-model-len 4096Hardware: RTX 4090 (24GB) or A10G (24GB)
Multi-GPU (13B+ models)
# 13B: 2 GPUs recommended
vllm serve llama-2-13b-chat-hf \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.85
# 70B: 4 GPUs required
vllm serve llama-2-70b-chat-hf \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.870B Minimum Config
# 4 x A100 40GB or equivalent
# Note: 70B fp16 needs ~140GB VRAM
# Requires 4-GPU tensor parallel or quantizationQuantization: Run Big Models on Consumer GPUs
No 4x A100? Use quantization.
# GPTQ quantization (Q4)
vllm serve meta-llama/Llama-2-70b-chat-hf-gptq \
--quantization gptq \
--dtype halfAWQ quantization has better quality but needs preprocessing:
# Use llm-awq to quantize first
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
# Load quantized model
vllm serve my-quantized-70b-model \
--quantization awqIn testing: 70B quantized to 4-bit runs on 2 x A100 80GB.
Common Issues
1. CUDA out of memory
# Lower gpu-memory-utilization
vllm serve model --gpu-memory-utilization 0.8
# Or reduce max-model-len
vllm serve model --max-model-len 20482. Slow First Load
vLLM loads model in chunks on first request, subsequent requests are fast. If frequent restarts, use --enable-prefix-caching:
python -m vllm.entrypoints.openai.api_server --enable-prefix-caching3. Performance Drops Under Concurrent Load
# Adjust block size
vllm serve model --block-size 32
# Or adjust max-num-batched-tokens
vllm serve model --max-num-batched-tokens 8192vLLM vs Alternatives
| vLLM | TGI | llama.cpp | |
|---|---|---|---|
| Developer | UC Berkeley | HuggingFace | Georgi Gerganov |
| Optimization | PagedAttention | Multiple | CPU+GPU hybrid |
| Multi-GPU | Good (TP) | Good | Poor |
| Quantization | GPTQ/AWQ | GPTQ/AWQ/GGUF | GGUF (native) |
| Consumer GPU | Needs quant | Needs quant | Strong (CPU works) |
| Ease of use | Medium | Low | Low |
| Activity | High | High | High |
When to Use vLLM
Good for:
- GPU server, need high concurrent inference
- Self-host to avoid data leaving your infrastructure
- Latency and throughput requirements
Not for:
- CPU only, no GPU → use llama.cpp
- Just quickly testing a model → use transformers pipeline
- Fastest first inference → vLLM has cold start
Conclusion
vLLM is one of the current best solutions for open-source LLM serving. PagedAttention delivers fundamental improvements in memory utilization and throughput.
Deployment recommendations:
- 7B model: single RTX 4090 + vLLM
- 13B model: 2x A10G or quantized single card
- 70B model: 4x A100 or quantized to 4-bit on 2 cards