vLLM in Action: How to Run Open-Source LLMs Efficiently on GPU Servers

Simi included in AI

2024-06-15 767 words 4 minutes

Why vLLM Matters

Everyone who’s run open-source LLMs has hit this problem: model loads into GPU memory, VRAM usage is absurdly high, but GPU utilization stays low.

Traditional LLM serving’s problem:

Input: "Hello, how are"
Token: [0, 1, 2, 3, ...] → [0, 1, 2, 3, 4, 5, ...]
     ↓
Attention Calculation
     ↓
Every token needs access to full KV cache
     ↓
KV cache pre-allocated as fixed-size contiguous blocks per request
     ↓
Massive memory fragmentation + low GPU utilization

Root cause: traditional approach pre-allocates fixed-length continuous VRAM for each request’s KV cache. But actual generation lengths vary wildly across requests, causing massive waste.

vLLM’s Core: PagedAttention

vLLM comes from UC Berkeley’s research team, released June 2023. Their core innovation is PagedAttention—inspired by OS virtual memory paging.

How It Works

Traditional approach:

Request 1: [Token 1][Token 2][Token 3][Empty...Empty]  (pre-allocated 512 slots)
Request 2: [Token 1][Token 2][Empty...Empty...Empty]    (pre-allocated 512 slots)

PagedAttention:

KV Cache: [Block 0][Block 1][Block 2][Block 3][Block 4][Block 5][Block 6][Block 7]
          ───────────────────────────────────────────────────────────────
Request 1: P1    | P1    | P1    | P1    |
Request 2: P2    | P2    |       |       |

Each request’s KV cache is no longer a contiguous chunk but scattered across multiple blocks, allocated on-demand. Like OS virtual memory paging.

Real-World Impact

From vLLM paper and our testing:

Metric	Traditional	vLLM	Improvement
Throughput (tokens/s)	baseline	+24x	24x
Memory usage	100%	~60%	40% saved
Concurrent requests	1-2	10-20	10x

Deployment Guide

Environment Setup

        
# CUDA 12.1+ required
nvidia-smi  # Confirm GPU available

# Docker deployment recommended
docker pull nvidia/cuda:12.1.0-runtime-ubuntu22.04

Install vLLM

        
# Option 1: pip install
pip install vllm

# Option 2: build from source (slower but latest optimizations)
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .

Start API Server

        
# Minimal startup
vllm serve meta-llama/Llama-2-7b-chat-hf \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 2  # if multi-GPU

        
        
        
    
# Production config
vllm serve meta-llama/Llama-2-70b-chat-hf \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.92 \
    --max-model-len 8192 \
    --enforce-eager \
    --trust-remote-code

Usage Example

        
        
        
    
# OpenAI-compatible API
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-2-7b-chat-hf",
    "messages": [{"role": "user", "content": "Explain PagedAttention in one sentence."}],
    "max_tokens": 100
  }'

Python client:

        
        
        
    
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY"  # No API key by default
)

response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is vLLM?"}
    ],
    max_tokens=200,
    temperature=0.7
)

print(response.choices[0].message.content)

Hardware Recommendations

Single Card (7B models)

        
# 7B runs fine on single card
vllm serve llama-2-7b \
    --gpu-memory-utilization 0.9 \
    --max-model-len 4096

Hardware: RTX 4090 (24GB) or A10G (24GB)

Multi-GPU (13B+ models)

        
        
        
    
# 13B: 2 GPUs recommended
vllm serve llama-2-13b-chat-hf \
    --tensor-parallel-size 2 \
    --gpu-memory-utilization 0.85

# 70B: 4 GPUs required
vllm serve llama-2-70b-chat-hf \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.8

70B Minimum Config

        
# 4 x A100 40GB or equivalent
# Note: 70B fp16 needs ~140GB VRAM
# Requires 4-GPU tensor parallel or quantization

Quantization: Run Big Models on Consumer GPUs

No 4x A100? Use quantization.

        
# GPTQ quantization (Q4)
vllm serve meta-llama/Llama-2-70b-chat-hf-gptq \
    --quantization gptq \
    --dtype half

AWQ quantization has better quality but needs preprocessing:

        
# Use llm-awq to quantize first
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

# Load quantized model
vllm serve my-quantized-70b-model \
    --quantization awq

In testing: 70B quantized to 4-bit runs on 2 x A100 80GB.

Common Issues

1. CUDA out of memory

        
# Lower gpu-memory-utilization
vllm serve model --gpu-memory-utilization 0.8

# Or reduce max-model-len
vllm serve model --max-model-len 2048

2. Slow First Load

vLLM loads model in chunks on first request, subsequent requests are fast. If frequent restarts, use --enable-prefix-caching:

python -m vllm.entrypoints.openai.api_server --enable-prefix-caching

3. Performance Drops Under Concurrent Load

        
# Adjust block size
vllm serve model --block-size 32

# Or adjust max-num-batched-tokens
vllm serve model --max-num-batched-tokens 8192

vLLM vs Alternatives

	vLLM	TGI	llama.cpp
Developer	UC Berkeley	HuggingFace	Georgi Gerganov
Optimization	PagedAttention	Multiple	CPU+GPU hybrid
Multi-GPU	Good (TP)	Good	Poor
Quantization	GPTQ/AWQ	GPTQ/AWQ/GGUF	GGUF (native)
Consumer GPU	Needs quant	Needs quant	Strong (CPU works)
Ease of use	Medium	Low	Low
Activity	High	High	High

When to Use vLLM

Good for:

GPU server, need high concurrent inference
Self-host to avoid data leaving your infrastructure
Latency and throughput requirements

Not for:

CPU only, no GPU → use llama.cpp
Just quickly testing a model → use transformers pipeline
Fastest first inference → vLLM has cold start

Conclusion

vLLM is one of the current best solutions for open-source LLM serving. PagedAttention delivers fundamental improvements in memory utilization and throughput.

Deployment recommendations:

7B model: single RTX 4090 + vLLM
13B model: 2x A10G or quantized single card
70B model: 4x A100 or quantized to 4-bit on 2 cards

Repo: github.com/vllm-project/vllm

Contents