Contents

vLLM in Action: How to Run Open-Source LLMs Efficiently on GPU Servers

Why vLLM Matters

Everyone who’s run open-source LLMs has hit this problem: model loads into GPU memory, VRAM usage is absurdly high, but GPU utilization stays low.

Traditional LLM serving’s problem:

Input: "Hello, how are"
Token: [0, 1, 2, 3, ...] → [0, 1, 2, 3, 4, 5, ...]
     ↓
Attention Calculation
     ↓
Every token needs access to full KV cache
     ↓
KV cache pre-allocated as fixed-size contiguous blocks per request
     ↓
Massive memory fragmentation + low GPU utilization

Root cause: traditional approach pre-allocates fixed-length continuous VRAM for each request’s KV cache. But actual generation lengths vary wildly across requests, causing massive waste.

vLLM’s Core: PagedAttention

vLLM comes from UC Berkeley’s research team, released June 2023. Their core innovation is PagedAttention—inspired by OS virtual memory paging.

How It Works

Traditional approach:

Request 1: [Token 1][Token 2][Token 3][Empty...Empty]  (pre-allocated 512 slots)
Request 2: [Token 1][Token 2][Empty...Empty...Empty]    (pre-allocated 512 slots)

PagedAttention:

KV Cache: [Block 0][Block 1][Block 2][Block 3][Block 4][Block 5][Block 6][Block 7]
          ───────────────────────────────────────────────────────────────
Request 1: P1    | P1    | P1    | P1    |
Request 2: P2    | P2    |       |       |

Each request’s KV cache is no longer a contiguous chunk but scattered across multiple blocks, allocated on-demand. Like OS virtual memory paging.

Real-World Impact

From vLLM paper and our testing:

Metric Traditional vLLM Improvement
Throughput (tokens/s) baseline +24x 24x
Memory usage 100% ~60% 40% saved
Concurrent requests 1-2 10-20 10x

Deployment Guide

Environment Setup

# CUDA 12.1+ required
nvidia-smi  # Confirm GPU available

# Docker deployment recommended
docker pull nvidia/cuda:12.1.0-runtime-ubuntu22.04

Install vLLM

# Option 1: pip install
pip install vllm

# Option 2: build from source (slower but latest optimizations)
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .

Start API Server

# Minimal startup
vllm serve meta-llama/Llama-2-7b-chat-hf \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 2  # if multi-GPU
# Production config
vllm serve meta-llama/Llama-2-70b-chat-hf \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.92 \
    --max-model-len 8192 \
    --enforce-eager \
    --trust-remote-code

Usage Example

# OpenAI-compatible API
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-2-7b-chat-hf",
    "messages": [{"role": "user", "content": "Explain PagedAttention in one sentence."}],
    "max_tokens": 100
  }'

Python client:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY"  # No API key by default
)

response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is vLLM?"}
    ],
    max_tokens=200,
    temperature=0.7
)

print(response.choices[0].message.content)

Hardware Recommendations

Single Card (7B models)

# 7B runs fine on single card
vllm serve llama-2-7b \
    --gpu-memory-utilization 0.9 \
    --max-model-len 4096

Hardware: RTX 4090 (24GB) or A10G (24GB)

Multi-GPU (13B+ models)

# 13B: 2 GPUs recommended
vllm serve llama-2-13b-chat-hf \
    --tensor-parallel-size 2 \
    --gpu-memory-utilization 0.85

# 70B: 4 GPUs required
vllm serve llama-2-70b-chat-hf \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.8

70B Minimum Config

# 4 x A100 40GB or equivalent
# Note: 70B fp16 needs ~140GB VRAM
# Requires 4-GPU tensor parallel or quantization

Quantization: Run Big Models on Consumer GPUs

No 4x A100? Use quantization.

# GPTQ quantization (Q4)
vllm serve meta-llama/Llama-2-70b-chat-hf-gptq \
    --quantization gptq \
    --dtype half

AWQ quantization has better quality but needs preprocessing:

# Use llm-awq to quantize first
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

# Load quantized model
vllm serve my-quantized-70b-model \
    --quantization awq

In testing: 70B quantized to 4-bit runs on 2 x A100 80GB.

Common Issues

1. CUDA out of memory

# Lower gpu-memory-utilization
vllm serve model --gpu-memory-utilization 0.8

# Or reduce max-model-len
vllm serve model --max-model-len 2048

2. Slow First Load

vLLM loads model in chunks on first request, subsequent requests are fast. If frequent restarts, use --enable-prefix-caching:

python -m vllm.entrypoints.openai.api_server --enable-prefix-caching

3. Performance Drops Under Concurrent Load

# Adjust block size
vllm serve model --block-size 32

# Or adjust max-num-batched-tokens
vllm serve model --max-num-batched-tokens 8192

vLLM vs Alternatives

vLLM TGI llama.cpp
Developer UC Berkeley HuggingFace Georgi Gerganov
Optimization PagedAttention Multiple CPU+GPU hybrid
Multi-GPU Good (TP) Good Poor
Quantization GPTQ/AWQ GPTQ/AWQ/GGUF GGUF (native)
Consumer GPU Needs quant Needs quant Strong (CPU works)
Ease of use Medium Low Low
Activity High High High

When to Use vLLM

Good for:

  • GPU server, need high concurrent inference
  • Self-host to avoid data leaving your infrastructure
  • Latency and throughput requirements

Not for:

  • CPU only, no GPU → use llama.cpp
  • Just quickly testing a model → use transformers pipeline
  • Fastest first inference → vLLM has cold start

Conclusion

vLLM is one of the current best solutions for open-source LLM serving. PagedAttention delivers fundamental improvements in memory utilization and throughput.

Deployment recommendations:

  • 7B model: single RTX 4090 + vLLM
  • 13B model: 2x A10G or quantized single card
  • 70B model: 4x A100 or quantized to 4-bit on 2 cards

Repo: github.com/vllm-project/vllm