Ollama in Practice: Running GPT-Level Models on Your Mac
Why Run Locally
API bills from OpenAI get ridiculous. GPT-4o adds up fast, and your data goes through a third party.
Running LLMs locally:
- Cost: $0 inference, just electricity
- Privacy: data never leaves your machine
- Offline: works without internet
- Control: model and parameters completely yours
But historically, running LLMs locally meant: CUDA drivers, NVIDIA drivers, quantization tools, complex configuration at every step.
Ollama solves exactly this.
Installation
Mac:
brew install ollamaLinux:
curl -fsSL https://ollama.com/install.sh | shWindows: Download from ollama.com/download.
After install, just run:
ollama run llama3.2
# or smaller models
ollama run mistral
ollama run phi3Starts in 30 seconds. Zero configuration required.
Choosing a Model
Ollama library has hundreds of models. Start with these:
| Model | Params | Min RAM | Best For |
|---|---|---|---|
| llama3.2 | 3B | 8GB | Daily conversation, quick tasks |
| mistral | 7B | 16GB | Coding workhorse, general tasks |
| llama3.1 | 8B | 16GB | Slightly stronger than Mistral |
| codellama | 7B | 16GB | Coding-specific |
| phi3 | 3.8B | 8GB | Ultra-lightweight, CPU-capable |
My Mac Studio M2 Max setup:
# Can run these
ollama run mistral # Smooth
ollama run codellama # Smooth, recommend for coding
ollama run llama3.2 # Extremely smoothCoding Benchmark: Codellama vs GPT-4
Tested Codellama 7B on writing code. Task: add complete CRUD endpoints to a FastAPI project.
Codellama Performance
ollama run codellama
# Input:
# Write complete user CRUD for this FastAPI project:
# - GET /users/{id}
# - POST /users
# - PUT /users/{id}
# - DELETE /users/{id}
# Use SQLAlchemy, async style, include Pydantic modelsCodellama 7B output complete code:
# Correctly generated:
# 1. Pydantic models (UserCreate, UserUpdate, UserResponse)
# 2. SQLAlchemy model
# 3. CRUD functions (get_user, create_user, etc.)
# 4. Router with all endpoints
@router.get("/{user_id}", response_model=UserResponse)
async def read_user(user_id: int, db: Session = Depends(get_db)):
user = get_user(db, user_id=user_id)
if user is None:
raise HTTPException(status_code=404, detail="User not found")
return userQuality is below GPT-4, but usable. For simple-to-medium complexity tasks, Codellama gets the job done.
When to Use Local
Local isn’t always the answer. Decision framework:
Good for local:
- Simple to medium complexity tasks
- Fast iteration (saving API costs)
- Public codebase, no sensitive data
- Offline environments
Not good for local:
- Complex architectural decisions (limited reasoning)
- Requires latest knowledge (fixed knowledge cutoff)
- Long text processing (context window limits)
API Access: Integrating with Existing Code
Once Ollama is running, use it as an API server:
ollama serveThen call with any OpenAI-compatible client:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # dummy key
)
response = client.chat.completions.create(
model="llama3.2",
messages=[
{"role": "user", "content": "Explain FastAPI dependency injection"}
]
)
print(response.choices[0].message.content)This means LangChain, Dify, any OpenAI-compatible app switches to local models seamlessly.
Advanced: Custom Model Files
Ollama supports custom model configuration:
# Modelfile
FROM mistral
PARAMETER temperature 0.7
PARAMETER top_p 0.9
SYSTEM """
You are a professional Python programmer.
Only output code, no explanations.
Code must follow PEP8.
"""# Build custom model
ollama create my-coder -f Modelfile
# Use it
ollama run my-coderQuantization: Reducing Memory Footprint
7B models typically need 14GB+ RAM (fp16). Quantized to 4-bit: ~4GB.
Ollama quantizes by default (q4_0, q4_1, q5_0, q5_1, q8_0, etc.).
Manually specify quantization level:
OLLAMA_NUM_PARALLEL=1 ollama run mistral:7b-q4Real numbers: q4 quantized model from 14GB to 4.1GB. Performance loss ~5-10%, negligible for most tasks.
Conclusion
Ollama is the easiest way to run LLMs locally:
- Install: one command
- Run: one command
- API: OpenAI-compatible
- Cost: $0
My actual usage:
- Codellama 7B for daily coding (good enough for most tasks)
- GPT-4 for complex architecture decisions and code review
- Local for sensitive data projects
Mac Studio + Ollama is enough for a private coding assistant.
Repo: ollama.com