Contents

Running LLMs Locally in 2023: Hardware Configs for Every Budget

Clear Thing First

This article doesn’t hype any product—just real benchmark numbers. Different budgets run different models, and more expensive isn’t always better.

Real Benchmarks by Tier

Tier 1: Mac (Entry Level)

Config: MacBook Pro M1 16GB or Mac Studio M1 Max 64GB

# Models that actually run
ollama run llama2          # 7B model, smooth
ollama run codellama       # coding-specialized 7B, barely

# After quantization (Q4)
ollama run llama2:7b-q4    # 4GB VRAM usage, acceptable
Metric M1 16GB M1 Max 64GB
llama2 7B inference speed ~15 tokens/s ~25 tokens/s
RAM usage 14GB (full) 20GB
Can run 13B? ❌ Not enough RAM ❌ Barely

Conclusion: sufficient for daily coding assistance, don’t expect too much.

Tier 2: Gaming Laptop (Mainstream)

Config: RTX 4070 8GB or RTX 4080 12GB

# ollama supports NVIDIA GPU acceleration
export OLLAMA_VULKAN=1
ollama run llama2          # runs at 30+ tokens/s on 4070
Metric RTX 4070 8GB RTX 4080 12GB
llama2 7B speed ~35 tokens/s ~50 tokens/s
llama2 13B speed ~15 tokens/s ~35 tokens/s
Can run 70B? ❌ VRAM not enough ❌ Still not enough
Power draw ~200W ~300W

Conclusion: RTX 4080 is the best value choice. RTX 4070 works, RTX 4080 is better.

Tier 3: Desktop (Enthusiast)

Config: RTX 3090 24GB or RTX 4090 24GB

# RTX 3090/4090 can run 70B models (quantized)
ollama run llama2:70b-q4

# Real speed
# RTX 4090 + 70B Q4: ~15 tokens/s
Metric RTX 3090 24GB RTX 4090 24GB
llama2 70B Q4 speed ~10 tokens/s ~15 tokens/s
70B Q4 VRAM usage 20GB 20GB
Power draw ~350W ~450W
Value Medium High (relatively)

Conclusion: if you want to run 70B models, you need at least 24GB VRAM. RTX 4090 is 50% faster than 3090 with lower power consumption.

Tier 4: Professional (Server)

Config: NVIDIA A100 40GB or A6000 48GB

# Datacenter cards, not discussing price (you all know)
ollama run llama2:70b      # run at full precision, no quantization
# Speed: ~60 tokens/s

This tier is for companies or teams—individual developers rarely need this.

Recommendations by Use Case

Scenario 1: Daily Coding Assistance (Budget $500-1000)

Recommended config: RTX 4070 8GB + 32GB RAM + i5/i7 CPU

Total: ~$800-1000

Can do:
- llama2 7B runs smoothly
- codellama 7B runs smoothly
- Run 13B with quantization

Cannot do:
- Run 70B models
- Efficiently process long documents

Real experience: Mac Studio M2 Max 64GB is also an option at ~$4000, but lower power consumption and silent—good for running continuously.

Scenario 2: Heavy Usage (Budget $2000-3000)

Recommended config: RTX 4080 12GB + 64GB RAM

Total: ~$2500

Can do:
- llama2 13B runs smoothly
- Run 70B quantized (~10 tokens/s)
- Use as team shared inference server

Good for:
- Small team daily use
- Individual developers needing slightly larger models

Scenario 3: Professional Use (Budget $5000+)

Recommended config: RTX 4090 24GB + 128GB RAM

Total: ~$5500-6000

Can do:
- llama2 70B Q4 runs smoothly (~15 tokens/s)
- Can also run full-precision 70B (though slower)
- Use as small team main inference server

Practical Advice

Don’t buy more hardware than you need.

# Decision tree
if need to run 70B model:
     At least RTX 3090 24GB (budget $5000+)
elif need to run 13B model:
     RTX 4080 12GB (budget $2500+)
elif just daily coding assistance:
     Mac Studio M2 Max (budget $4000)
     Or RTX 4070 8GB (budget $1500)

Conclusion

Mid-2023 local LLM hardware choices:

  • $0-500: Use Ollama + Mac (M1/M2), 7B model is sufficient
  • $1000-2000: RTX 4070 8GB, 13B model usable
  • $2500-3500: RTX 4080 12GB, 70B quantized runs
  • $5000+: RTX 4090 24GB or professional cards

Most important point: first clarify what model you want to run, then decide what hardware to buy. Overbuying for small models is wasteful; buying inadequate hardware that can’t run your models is worse.