Running LLMs Locally in 2023: Hardware Configs for Every Budget

Simi included in AI

2023-07-15 620 words 3 minutes

Contents

Clear Thing First

This article doesn’t hype any product—just real benchmark numbers. Different budgets run different models, and more expensive isn’t always better.

Real Benchmarks by Tier

Tier 1: Mac (Entry Level)

Config: MacBook Pro M1 16GB or Mac Studio M1 Max 64GB

        
# Models that actually run
ollama run llama2          # 7B model, smooth
ollama run codellama       # coding-specialized 7B, barely

# After quantization (Q4)
ollama run llama2:7b-q4    # 4GB VRAM usage, acceptable

Metric	M1 16GB	M1 Max 64GB
llama2 7B inference speed	~15 tokens/s	~25 tokens/s
RAM usage	14GB (full)	20GB
Can run 13B?	❌ Not enough RAM	❌ Barely

Conclusion: sufficient for daily coding assistance, don’t expect too much.

Tier 2: Gaming Laptop (Mainstream)

Config: RTX 4070 8GB or RTX 4080 12GB

        
# ollama supports NVIDIA GPU acceleration
export OLLAMA_VULKAN=1
ollama run llama2          # runs at 30+ tokens/s on 4070

Metric	RTX 4070 8GB	RTX 4080 12GB
llama2 7B speed	~35 tokens/s	~50 tokens/s
llama2 13B speed	~15 tokens/s	~35 tokens/s
Can run 70B?	❌ VRAM not enough	❌ Still not enough
Power draw	~200W	~300W

Conclusion: RTX 4080 is the best value choice. RTX 4070 works, RTX 4080 is better.

Tier 3: Desktop (Enthusiast)

Config: RTX 3090 24GB or RTX 4090 24GB

        
# RTX 3090/4090 can run 70B models (quantized)
ollama run llama2:70b-q4

# Real speed
# RTX 4090 + 70B Q4: ~15 tokens/s

Metric	RTX 3090 24GB	RTX 4090 24GB
llama2 70B Q4 speed	~10 tokens/s	~15 tokens/s
70B Q4 VRAM usage	20GB	20GB
Power draw	~350W	~450W
Value	Medium	High (relatively)

Conclusion: if you want to run 70B models, you need at least 24GB VRAM. RTX 4090 is 50% faster than 3090 with lower power consumption.

Tier 4: Professional (Server)

Config: NVIDIA A100 40GB or A6000 48GB

        
# Datacenter cards, not discussing price (you all know)
ollama run llama2:70b      # run at full precision, no quantization
# Speed: ~60 tokens/s

This tier is for companies or teams—individual developers rarely need this.

Recommendations by Use Case

Scenario 1: Daily Coding Assistance (Budget $500-1000)

Recommended config: RTX 4070 8GB + 32GB RAM + i5/i7 CPU

Total: ~$800-1000

Can do:
- llama2 7B runs smoothly
- codellama 7B runs smoothly
- Run 13B with quantization

Cannot do:
- Run 70B models
- Efficiently process long documents

Real experience: Mac Studio M2 Max 64GB is also an option at ~$4000, but lower power consumption and silent—good for running continuously.

Scenario 2: Heavy Usage (Budget $2000-3000)

Recommended config: RTX 4080 12GB + 64GB RAM

Total: ~$2500

Can do:
- llama2 13B runs smoothly
- Run 70B quantized (~10 tokens/s)
- Use as team shared inference server

Good for:
- Small team daily use
- Individual developers needing slightly larger models

Scenario 3: Professional Use (Budget $5000+)

Recommended config: RTX 4090 24GB + 128GB RAM

Total: ~$5500-6000

Can do:
- llama2 70B Q4 runs smoothly (~15 tokens/s)
- Can also run full-precision 70B (though slower)
- Use as small team main inference server

Practical Advice

Don’t buy more hardware than you need.

        
        
        
    
# Decision tree
if need to run 70B model:
    → At least RTX 3090 24GB (budget $5000+)
elif need to run 13B model:
    → RTX 4080 12GB (budget $2500+)
elif just daily coding assistance:
    → Mac Studio M2 Max (budget $4000)
    → Or RTX 4070 8GB (budget $1500)

Conclusion

Mid-2023 local LLM hardware choices:

$0-500: Use Ollama + Mac (M1/M2), 7B model is sufficient
$1000-2000: RTX 4070 8GB, 13B model usable
$2500-3500: RTX 4080 12GB, 70B quantized runs
$5000+: RTX 4090 24GB or professional cards

Most important point: first clarify what model you want to run, then decide what hardware to buy. Overbuying for small models is wasteful; buying inadequate hardware that can’t run your models is worse.