Contents

Run LLM Locally

Running LLM (Large Language Model) locally can be a great way to take advantage of its capabilities without needing an internet connection.

You can achieve this through Ollama, an open-source project that allows you to run AI models on your own hardware.

Why Run an LLM Locally?

There are several reasons why you might want to run an LLM locally:

  1. Privacy: By running the model locally, you can ensure that your data is not sent to any third-party servers, which can be a concern for sensitive or confidential information.
  2. Cost: Running the model locally can be more cost-effective than relying on cloud-based services, especially for large-scale projects or applications that require continuous usage.
  3. Customization: Having the model locally allows you to customize it to your specific needs and integrate it with other tools and systems that you may be using.
  4. Faster Response Time: Local deployment can result in faster response times compared to relying on cloud-based services, which can be beneficial for applications that require real-time responses.
  5. Offline Capabilities: With the model running locally, you can still use it even when you don’t have internet access, which can be useful for applications in areas with unreliable connectivity or for tasks that require offline processing.

Install ollama

macOS

Download

Windows

Coming soon! For now, you can install Ollama on Windows via WSL2.

Linux & WSL2

1
curl https://ollama.ai/install.sh | sh

Manual install instructions

Docker

The official Ollama Docker image ollama/ollama is available on Docker Hub.

Quickstart

To run and chat with Llama 2:

1
ollama run llama2

Model library

Ollama supports a list of open-source models available on ollama.ai/library

llama2

The most popular model for general use.

Llama 2 is released by Meta Platforms, Inc. This model is trained on 2 trillion tokens, and by default supports a context length of 4096. Llama 2 Chat models are fine-tuned on over 1 million human annotations, and are made for chat.

CLI

1
ollama run llama2

API

1
2
3
4
curl -X POST http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt":"Why is the sky blue?"
 }'

API documentation

Memory requirements

  • 7b models generally require at least 8GB of RAM
  • 13b models generally require at least 16GB of RAM
  • 70b models generally require at least 64GB of RAM

If you run into issues with higher quantization levels, try using the q4 model or shut down any other programs that are using a lot of memory.

mistral

The 7B model released by Mistral AI, updated to version 0.2.

Mistral is a 7.3B parameter model, distributed with the Apache license. It is available in both instruct (instruction following) and text completion.

Note: this model has been updated to Mistral v0.2. The original model is available as mistral:7b-instruct-q4_0.

The Mistral AI team has noted that Mistral 7B:

  • Outperforms Llama 2 13B on all benchmarks
  • Outperforms Llama 1 34B on many benchmarks
  • Approaches CodeLlama 7B performance on code, while remaining good at English tasks

CLI

Instruct:

1
ollama run mistral

Text completion:

1
ollama run mistral:text

API

1
2
3
4
curl -X POST http://localhost:11434/api/generate -d '{
  "model": "mistral",
  "prompt":"Here is a story about llamas eating grass"
 }'

Memory requirements

  • 7b models generally require at least 8GB of RAM

codellama

A large language model that can use text prompts to generate and discuss code.

Code Llama is a model for generating and discussing code, built on top of Llama 2. It’s designed to make workflows faster and efficient for developers and make it easier for people to learn how to code. It can generate both code and natural language about code. Code Llama supports many of the most popular programming languages used today, including Python, C++, Java, PHP, Typescript (Javascript), C#, Bash and more.

CLI

1
ollama run codellama "Write me a function that outputs the fibonacci sequence"

API

1
2
3
4
curl -X POST http://localhost:11434/api/generate -d '{
  "model": "codellama",
  "prompt": "Write me a function that outputs the fibonacci sequence"
}'

Memory requirements

  • 7 billion: 8GB
  • 13 billion: 16GB
  • 34 billion: 32GB

dolphin-mixtral

An uncensored, fine-tuned model based on the Mixtral mixture of experts model that excels at coding tasks. Created by Eric Hartford.

The Dolphin model by Eric Hartford based on Mixtral that is trained with additional datasets:

  • Synthia, OpenHermes and PureDove
  • New Dolphin-Coder
  • MagiCoder

CLI

1
ollama run dolphin-mixtral

API

1
2
3
4
curl -X POST http://localhost:11434/api/generate -d '{
  "model": "dolphin-mixtral",
  "prompt": "Why is the sky blue?"
}'

Memory requirements

  • 48GB+ memory required

llama2-uncensored

Uncensored Llama 2 model by George Sung and Jarrad Hope.

Llama 2 Uncensored is based on Meta’s Llama 2 model, and was created by George Sung and Jarrad Hope using the process defined by Eric Hartford in his blog post.

CLI

1
ollama run llama2-uncensored

API

1
2
3
4
curl -X POST http://localhost:11434/api/generate -d '{
  "model": "llama2-uncensored",
  "prompt":"Write a recipe for dangerously spicy mayo."
 }'

Memory requirements

  • 7b models generally require at least 8GB of RAM
  • 70b models generally require at least 64GB of RAM

orca-mini

A general-purpose model ranging from 3 billion parameters to 70 billion, suitable for entry-level hardware.

Orca Mini is a Llama and Llama 2 model trained on Orca Style datasets created using the approaches defined in the paper, Orca: Progressive Learning from Complex Explanation Traces of GPT-4. There are two variations available. The original Orca Mini based on Llama in 3, 7, and 13 billion parameter sizes, and v3 based on Llama 2 in 7, 13, and 70 billion parameter sizes.

CLI

1
https://ollama.ai/library/orca-mini

API

1
2
3
4
curl -X POST http://localhost:11434/api/generate -d '{
    "model": "orca-mini",
    "prompt":"Why is the sky blue?"
   }'

Memory requirements

  • 7b models generally require at least 8GB of RAM
  • 13b models generally require at least 16GB of RAM
  • 70b models generally require at least 64GB of RAM

wizard-vicuna-uncensored

Wizard Vicuna Uncensored is a 7B, 13B, and 30B parameter model based on Llama 2 uncensored by Eric Hartford.

Wizard Vicuna Uncensored is a 7B, 13B, and 30B parameter model based on Llama 2 uncensored by Eric Hartford. The models were trained against LLaMA-7B with a subset of the dataset, responses that contained alignment / moralizing were removed.

CLI

1
ollama run wizard-vicuna-uncensored

API

1
2
3
4
curl -X POST http://localhost:11434/api/generate -d '{
  "model": "wizard-vicuna-uncensored",
  "prompt":"Who made Rose promise that she would never let go?"
 }'

Memory requirements

  • 7b models generally require at least 8GB of RAM
  • 13b models generally require at least 16GB of RAM
  • 30b models generally require at least 32GB of RAM

If you run into issues with higher quantization levels, try using the q4 model or shut down any other programs that are using a lot of memory.

References