Run LLM Locally

Simi included in AI

2023-12-21 1736 words 9 minutes

/posts/run-llm-locally/featured-image.png

Contents

Running an LLM locally is useful in fewer situations than the internet makes it sound. But in the right setup, it is genuinely valuable.

I would consider local inference first in these cases:

internal tools that touch source code, tickets, docs, or notes you do not want to ship to a hosted API by default
workflows that need predictable cost instead of per-token billing surprises
offline or poor-network environments
low-latency assistant features where network round trips are more annoying than model quality gaps
evaluation and prototyping, where you want to swap models fast and inspect behavior directly

I would not default to local models for every use case. If the task needs the best possible reasoning, very long context, strong multilingual quality, or production-grade tool use with minimal babysitting, hosted models are still often the simpler answer.

Local inference is a systems decision, not a vibe

Most disappointment with local LLMs comes from treating them like a model catalog problem. It is usually a systems problem instead:

What latency is acceptable?
Is the workload interactive chat, batch summarization, code completion, or extraction?
Do you need one user, ten users, or a whole team?
Can you tolerate occasional hallucinations if the data never leaves your machine?
Is the bottleneck memory, not compute?

Those questions matter more than which model family is currently fashionable.

The hardware trade-off that actually matters: memory

For local inference, memory is usually the constraint you feel first.

A practical mental model:

Small models fit easily and respond quickly, but hit a ceiling sooner on reasoning and instruction fidelity.
Mid-sized models are the sweet spot for many laptops and desktops.
Large models may look attractive on paper, but once they barely fit, latency gets ugly and the system becomes annoying to use.

On Apple Silicon and consumer GPUs, the difference between “fits comfortably” and “technically fits” is the difference between a tool you use daily and one you abandon after a week.

Things that matter more than raw FLOPS in day-to-day use:

available RAM / unified memory / VRAM
memory bandwidth
whether the runtime can keep most weights on the fast device
context size, which also consumes memory
how many concurrent requests you expect

If you only remember one rule, remember this: buy headroom, not just capacity.

Quantization is the reason local models are practical

Without quantization, many open models would be irrelevant for normal hardware.

Quantization reduces memory footprint and usually improves throughput enough to make local inference usable. The trade-off is quality loss, but in practice the loss is often acceptable if you choose the right level.

Operationally, think about quantization like this:

lower-bit quantization helps a model fit and run faster
aggressive quantization can noticeably hurt instruction following, coding accuracy, or multilingual output
the “best” quant depends on your task, not on a generic benchmark

For internal tooling, I would rather run a slightly smaller model at a sane quantization level than force a larger model into a heavily compromised setup.

A common mistake is comparing:

one model at a comfortable quantization
another model squeezed into a too-aggressive quantization

That is not a fair model comparison. That is a hardware comparison in disguise.

Privacy is better locally, but not magic

“Local means private” is directionally true, but still incomplete.

What local deployment does give you:

prompts and documents do not need to leave your machine or your own network
you can control logs, storage, and retention
you can keep sensitive context out of third-party APIs by default

What it does not guarantee automatically:

the runtime itself may still log requests if you enable verbose telemetry or debugging
your application may cache prompts or outputs in files, browser storage, or databases
copied text still ends up in clipboards, shells, notebooks, and editor histories
fine-tuning or RAG pipelines can leak data through sloppy indexing and access control

So the privacy benefit is real, but only if the surrounding system is also built sanely.

Pick the model by task, not by leaderboard

In 2026, open model generations move fast enough that hardcoded “top 10 local models” articles rot quickly. A more durable approach is to map model choice to job shape.

For coding assistance

Priorities:

correct syntax under pressure
decent repository summarization
stable edits over long files
fewer fake APIs and invented config keys

Look for models with a good reputation in code generation and code explanation, then test them on your actual repo tasks: refactors, test writing, config debugging, migration diffs.

For writing and summarization

Priorities:

tone control
ability to stay grounded in provided text
low tendency to embellish
consistent formatting

Here, smaller or mid-sized instruct models can be enough if your prompts and retrieval are disciplined.

For multilingual work

Priorities:

natural output in your target language
not just translation, but idiomatic phrasing
lower hallucination rate on domain terms

Do not assume a model that is good in English will behave the same way in Chinese, Japanese, German, or mixed-language internal docs. Test that explicitly.

For extraction and classification

Priorities:

schema compliance
stable JSON output
low variance across repeated runs
ability to say “missing” instead of inventing values

For these tasks, brute “intelligence” matters less than consistency. Smaller models often do surprisingly well if the prompt and examples are tight.

For chat UX

Priorities:

low latency
acceptable instruction following
enough context for the conversation shape
predictable behavior under repetition

For interactive tools, response speed often matters more than squeezing out another few benchmark points.

Ollama and similar runtimes are useful because they reduce friction

You do not need to marry one runtime. But tools like Ollama are popular for a reason: they make local serving boring enough to be usable.

That matters.

A sensible local runtime should give you:

easy model pulls and versioning
a local HTTP API
streaming responses
control over context window and generation parameters
sane model management on laptop or workstation hardware

For quick experiments, a local runtime plus a simple API client is usually enough.

Install Ollama

Check the project’s current installation guide because package and platform support changes over time:

Ollama repository

Typical usage is still simple:

ollama serve

Then in another shell:

ollama run <model-name>

Or call it over HTTP:

        
curl http://localhost:11434/api/generate -d '{
  "model": "<model-name>",
  "prompt": "Summarize the following deployment incident in 5 bullet points..."
}'

The exact model names will keep changing. The workflow does not.

A sane local setup looks like this

If I were wiring local LLMs into an actual engineering workflow, I would start with:

one general instruct model
one coding-oriented model
one small fast model for cheap classification or formatting tasks

Then I would keep a tiny evaluation set for each workload.

For example:

20 internal documentation summarization samples
20 codebase questions with expected answers
20 extraction tasks with expected JSON
10 multilingual tone checks
10 “don’t hallucinate, say unknown” prompts

That is enough to stop making decisions from vibes.

Evaluate behavior, not just output quality

When people say “model A is better,” they often mean one of several different things:

it sounds more fluent
it follows instructions more reliably
it is faster
it produces fewer broken JSON outputs
it hallucinates less on missing facts
it survives longer context better

Those are different axes.

A practical evaluation loop:

1. Freeze a small task set

Use real prompts from your workflow, not toy riddles.

2. Compare under the same conditions

Same quantization class if possible, same runtime, same hardware, same context settings.

3. Score failure modes explicitly

Examples:

fabricated citations
wrong config keys
invalid JSON
ignores system instruction
loses language consistency
drops required fields

4. Re-run after runtime or model changes

Local stacks evolve fast. A runtime update can change throughput or memory behavior enough to matter.

Prompting still matters, but grounding matters more

Local models are not a loophole around prompt quality.

If you want stable behavior:

keep prompts concrete
specify output shape
provide examples when formatting matters
tell the model what to do when information is missing
separate instructions from source material clearly

But even a good prompt does not fix missing knowledge. For anything factual or repo-specific, you still need grounding:

retrieval over your docs or code
constrained context windows
explicit source snippets
post-generation validation when format matters

This is especially important locally because smaller open models can sound confident long before they are correct.

Limits you should expect

Local LLMs are useful, but there are hard edges:

Long context is expensive

Even if a model advertises a large context window, usable quality over that full range is another question. Memory and latency also rise with context.

Tool use is uneven

Some local models can handle structured tool calling reasonably well. Many still need tighter guardrails than frontier hosted models.

Hallucinations remain normal

Local does not mean grounded. If the answer is not in the prompt or the model weights, it can still invent things.

Concurrency is easy to underestimate

A laptop that feels fine for one interactive user can collapse under team usage or background jobs.

Common mistakes

Treating one lucky answer as proof

A model that aces one prompt can still be unstable across twenty.

Over-indexing on benchmark screenshots

Benchmarks are useful as hints, not purchase orders.

Running the biggest model that fits

If latency makes the tool annoying, nobody will use it.

Forgetting output validation

If the model produces JSON for downstream systems, validate it. Always.

Mixing evaluation goals

Do not compare a “fast enough for autocomplete” model against a “slow but careful batch summarizer” model as if they serve the same role.

When local LLMs make sense

I would recommend local inference when:

data sensitivity is real
the workload is repetitive and bounded
you can evaluate against real tasks
hardware is already available or justified
“good enough and controllable” beats “best possible but external”

I would avoid it when:

the team expects frontier-model reasoning with no trade-offs
you need high concurrency on cheap hardware
you cannot afford to build evaluation and validation around it
the operational burden outweighs privacy or cost gains

Final take

Running LLMs locally is no longer a novelty, but it is still easy to use badly.

The practical path is simple:

pick tasks first
size hardware by memory headroom
use quantization deliberately
evaluate with real prompts
ground the model when facts matter
treat the runtime like infrastructure, not a toy

If you do that, local models become a useful engineering tool. If you do not, they become an expensive benchmark hobby.