Running LLMs Locally in 2023: Hardware Configs for Every Budget
Clear Thing First
This article doesn’t hype any product—just real benchmark numbers. Different budgets run different models, and more expensive isn’t always better.
Real Benchmarks by Tier
Tier 1: Mac (Entry Level)
Config: MacBook Pro M1 16GB or Mac Studio M1 Max 64GB
# Models that actually run
ollama run llama2 # 7B model, smooth
ollama run codellama # coding-specialized 7B, barely
# After quantization (Q4)
ollama run llama2:7b-q4 # 4GB VRAM usage, acceptable| Metric | M1 16GB | M1 Max 64GB |
|---|---|---|
| llama2 7B inference speed | ~15 tokens/s | ~25 tokens/s |
| RAM usage | 14GB (full) | 20GB |
| Can run 13B? | ❌ Not enough RAM | ❌ Barely |
Conclusion: sufficient for daily coding assistance, don’t expect too much.
Tier 2: Gaming Laptop (Mainstream)
Config: RTX 4070 8GB or RTX 4080 12GB
# ollama supports NVIDIA GPU acceleration
export OLLAMA_VULKAN=1
ollama run llama2 # runs at 30+ tokens/s on 4070| Metric | RTX 4070 8GB | RTX 4080 12GB |
|---|---|---|
| llama2 7B speed | ~35 tokens/s | ~50 tokens/s |
| llama2 13B speed | ~15 tokens/s | ~35 tokens/s |
| Can run 70B? | ❌ VRAM not enough | ❌ Still not enough |
| Power draw | ~200W | ~300W |
Conclusion: RTX 4080 is the best value choice. RTX 4070 works, RTX 4080 is better.
Tier 3: Desktop (Enthusiast)
Config: RTX 3090 24GB or RTX 4090 24GB
# RTX 3090/4090 can run 70B models (quantized)
ollama run llama2:70b-q4
# Real speed
# RTX 4090 + 70B Q4: ~15 tokens/s| Metric | RTX 3090 24GB | RTX 4090 24GB |
|---|---|---|
| llama2 70B Q4 speed | ~10 tokens/s | ~15 tokens/s |
| 70B Q4 VRAM usage | 20GB | 20GB |
| Power draw | ~350W | ~450W |
| Value | Medium | High (relatively) |
Conclusion: if you want to run 70B models, you need at least 24GB VRAM. RTX 4090 is 50% faster than 3090 with lower power consumption.
Tier 4: Professional (Server)
Config: NVIDIA A100 40GB or A6000 48GB
# Datacenter cards, not discussing price (you all know)
ollama run llama2:70b # run at full precision, no quantization
# Speed: ~60 tokens/sThis tier is for companies or teams—individual developers rarely need this.
Recommendations by Use Case
Scenario 1: Daily Coding Assistance (Budget $500-1000)
Recommended config: RTX 4070 8GB + 32GB RAM + i5/i7 CPU
Total: ~$800-1000
Can do:
- llama2 7B runs smoothly
- codellama 7B runs smoothly
- Run 13B with quantization
Cannot do:
- Run 70B models
- Efficiently process long documentsReal experience: Mac Studio M2 Max 64GB is also an option at ~$4000, but lower power consumption and silent—good for running continuously.
Scenario 2: Heavy Usage (Budget $2000-3000)
Recommended config: RTX 4080 12GB + 64GB RAM
Total: ~$2500
Can do:
- llama2 13B runs smoothly
- Run 70B quantized (~10 tokens/s)
- Use as team shared inference server
Good for:
- Small team daily use
- Individual developers needing slightly larger modelsScenario 3: Professional Use (Budget $5000+)
Recommended config: RTX 4090 24GB + 128GB RAM
Total: ~$5500-6000
Can do:
- llama2 70B Q4 runs smoothly (~15 tokens/s)
- Can also run full-precision 70B (though slower)
- Use as small team main inference serverPractical Advice
Don’t buy more hardware than you need.
# Decision tree
if need to run 70B model:
→ At least RTX 3090 24GB (budget $5000+)
elif need to run 13B model:
→ RTX 4080 12GB (budget $2500+)
elif just daily coding assistance:
→ Mac Studio M2 Max (budget $4000)
→ Or RTX 4070 8GB (budget $1500)Conclusion
Mid-2023 local LLM hardware choices:
- $0-500: Use Ollama + Mac (M1/M2), 7B model is sufficient
- $1000-2000: RTX 4070 8GB, 13B model usable
- $2500-3500: RTX 4080 12GB, 70B quantized runs
- $5000+: RTX 4090 24GB or professional cards
Most important point: first clarify what model you want to run, then decide what hardware to buy. Overbuying for small models is wasteful; buying inadequate hardware that can’t run your models is worse.