Ollama in Practice: Running GPT-Level Models on Your Mac

Simi included in AI

2024-04-20 590 words 3 minutes

Contents

Why Run Locally

API bills from OpenAI get ridiculous. GPT-4o adds up fast, and your data goes through a third party.

Running LLMs locally:

Cost: $0 inference, just electricity
Privacy: data never leaves your machine
Offline: works without internet
Control: model and parameters completely yours

But historically, running LLMs locally meant: CUDA drivers, NVIDIA drivers, quantization tools, complex configuration at every step.

Ollama solves exactly this.

Installation

Mac:

brew install ollama

Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows: Download from ollama.com/download.

After install, just run:

        
ollama run llama3.2

# or smaller models
ollama run mistral
ollama run phi3

Starts in 30 seconds. Zero configuration required.

Choosing a Model

Ollama library has hundreds of models. Start with these:

Model	Params	Min RAM	Best For
llama3.2	3B	8GB	Daily conversation, quick tasks
mistral	7B	16GB	Coding workhorse, general tasks
llama3.1	8B	16GB	Slightly stronger than Mistral
codellama	7B	16GB	Coding-specific
phi3	3.8B	8GB	Ultra-lightweight, CPU-capable

My Mac Studio M2 Max setup:

        
# Can run these
ollama run mistral       # Smooth
ollama run codellama     # Smooth, recommend for coding
ollama run llama3.2      # Extremely smooth

Coding Benchmark: Codellama vs GPT-4

Tested Codellama 7B on writing code. Task: add complete CRUD endpoints to a FastAPI project.

Codellama Performance

        
        
        
    
ollama run codellama

# Input:
# Write complete user CRUD for this FastAPI project:
# - GET /users/{id}
# - POST /users
# - PUT /users/{id}
# - DELETE /users/{id}
# Use SQLAlchemy, async style, include Pydantic models

Codellama 7B output complete code:

        
        
        
    
# Correctly generated:
# 1. Pydantic models (UserCreate, UserUpdate, UserResponse)
# 2. SQLAlchemy model
# 3. CRUD functions (get_user, create_user, etc.)
# 4. Router with all endpoints

@router.get("/{user_id}", response_model=UserResponse)
async def read_user(user_id: int, db: Session = Depends(get_db)):
    user = get_user(db, user_id=user_id)
    if user is None:
        raise HTTPException(status_code=404, detail="User not found")
    return user

Quality is below GPT-4, but usable. For simple-to-medium complexity tasks, Codellama gets the job done.

When to Use Local

Local isn’t always the answer. Decision framework:

Good for local:

Simple to medium complexity tasks
Fast iteration (saving API costs)
Public codebase, no sensitive data
Offline environments

Not good for local:

Complex architectural decisions (limited reasoning)
Requires latest knowledge (fixed knowledge cutoff)
Long text processing (context window limits)

API Access: Integrating with Existing Code

Once Ollama is running, use it as an API server:

ollama serve

Then call with any OpenAI-compatible client:

        
        
        
    
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # dummy key
)

response = client.chat.completions.create(
    model="llama3.2",
    messages=[
        {"role": "user", "content": "Explain FastAPI dependency injection"}
    ]
)

print(response.choices[0].message.content)

This means LangChain, Dify, any OpenAI-compatible app switches to local models seamlessly.

Advanced: Custom Model Files

Ollama supports custom model configuration:

        
        
        
    
# Modelfile
FROM mistral
PARAMETER temperature 0.7
PARAMETER top_p 0.9
SYSTEM """
You are a professional Python programmer.
Only output code, no explanations.
Code must follow PEP8.
"""

        
# Build custom model
ollama create my-coder -f Modelfile

# Use it
ollama run my-coder

Quantization: Reducing Memory Footprint

7B models typically need 14GB+ RAM (fp16). Quantized to 4-bit: ~4GB.

Ollama quantizes by default (q4_0, q4_1, q5_0, q5_1, q8_0, etc.).

Manually specify quantization level:

        
OLLAMA_NUM_PARALLEL=1 ollama run mistral:7b-q4

Real numbers: q4 quantized model from 14GB to 4.1GB. Performance loss ~5-10%, negligible for most tasks.

Conclusion

Ollama is the easiest way to run LLMs locally:

Install: one command
Run: one command
API: OpenAI-compatible
Cost: $0

My actual usage:

Codellama 7B for daily coding (good enough for most tasks)
GPT-4 for complex architecture decisions and code review
Local for sensitive data projects

Mac Studio + Ollama is enough for a private coding assistant.

Repo: ollama.com