Run LLM Locally
Running LLM (Large Language Model) locally can be a great way to take advantage of its capabilities without needing an internet connection.
You can achieve this through Ollama, an open-source project that allows you to run AI models on your own hardware.
Why Run an LLM Locally?
There are several reasons why you might want to run an LLM locally:
- Privacy: By running the model locally, you can ensure that your data is not sent to any third-party servers, which can be a concern for sensitive or confidential information.
- Cost: Running the model locally can be more cost-effective than relying on cloud-based services, especially for large-scale projects or applications that require continuous usage.
- Customization: Having the model locally allows you to customize it to your specific needs and integrate it with other tools and systems that you may be using.
- Faster Response Time: Local deployment can result in faster response times compared to relying on cloud-based services, which can be beneficial for applications that require real-time responses.
- Offline Capabilities: With the model running locally, you can still use it even when you don’t have internet access, which can be useful for applications in areas with unreliable connectivity or for tasks that require offline processing.
Install ollama
macOS
Windows
Coming soon! For now, you can install Ollama on Windows via WSL2.
Linux & WSL2
|
|
Docker
The official Ollama Docker image ollama/ollama is available on Docker Hub.
Quickstart
To run and chat with Llama 2:
|
|
Model library
Ollama supports a list of open-source models available on ollama.ai/library
Popular Models
llama2
The most popular model for general use.
Llama 2 is released by Meta Platforms, Inc. This model is trained on 2 trillion tokens, and by default supports a context length of 4096. Llama 2 Chat models are fine-tuned on over 1 million human annotations, and are made for chat.
CLI
|
|
API
|
|
Memory requirements
- 7b models generally require at least 8GB of RAM
- 13b models generally require at least 16GB of RAM
- 70b models generally require at least 64GB of RAM
If you run into issues with higher quantization levels, try using the q4 model or shut down any other programs that are using a lot of memory.
mistral
The 7B model released by Mistral AI, updated to version 0.2.
Mistral is a 7.3B parameter model, distributed with the Apache license. It is available in both instruct (instruction following) and text completion.
Note: this model has been updated to Mistral v0.2. The original model is available as mistral:7b-instruct-q4_0.
The Mistral AI team has noted that Mistral 7B:
- Outperforms Llama 2 13B on all benchmarks
- Outperforms Llama 1 34B on many benchmarks
- Approaches CodeLlama 7B performance on code, while remaining good at English tasks
CLI
Instruct:
|
|
Text completion:
|
|
API
|
|
Memory requirements
- 7b models generally require at least 8GB of RAM
codellama
A large language model that can use text prompts to generate and discuss code.
Code Llama is a model for generating and discussing code, built on top of Llama 2. It’s designed to make workflows faster and efficient for developers and make it easier for people to learn how to code. It can generate both code and natural language about code. Code Llama supports many of the most popular programming languages used today, including Python, C++, Java, PHP, Typescript (Javascript), C#, Bash and more.
CLI
|
|
API
|
|
Memory requirements
- 7 billion: 8GB
- 13 billion: 16GB
- 34 billion: 32GB
dolphin-mixtral
An uncensored, fine-tuned model based on the Mixtral mixture of experts model that excels at coding tasks. Created by Eric Hartford.
The Dolphin model by Eric Hartford based on Mixtral that is trained with additional datasets:
- Synthia, OpenHermes and PureDove
- New Dolphin-Coder
- MagiCoder
CLI
|
|
API
|
|
Memory requirements
- 48GB+ memory required
llama2-uncensored
Uncensored Llama 2 model by George Sung and Jarrad Hope.
Llama 2 Uncensored is based on Meta’s Llama 2 model, and was created by George Sung and Jarrad Hope using the process defined by Eric Hartford in his blog post.
CLI
|
|
API
|
|
Memory requirements
- 7b models generally require at least 8GB of RAM
- 70b models generally require at least 64GB of RAM
orca-mini
A general-purpose model ranging from 3 billion parameters to 70 billion, suitable for entry-level hardware.
Orca Mini is a Llama and Llama 2 model trained on Orca Style datasets created using the approaches defined in the paper, Orca: Progressive Learning from Complex Explanation Traces of GPT-4. There are two variations available. The original Orca Mini based on Llama in 3, 7, and 13 billion parameter sizes, and v3 based on Llama 2 in 7, 13, and 70 billion parameter sizes.
CLI
|
|
API
|
|
Memory requirements
- 7b models generally require at least 8GB of RAM
- 13b models generally require at least 16GB of RAM
- 70b models generally require at least 64GB of RAM
wizard-vicuna-uncensored
Wizard Vicuna Uncensored is a 7B, 13B, and 30B parameter model based on Llama 2 uncensored by Eric Hartford.
Wizard Vicuna Uncensored is a 7B, 13B, and 30B parameter model based on Llama 2 uncensored by Eric Hartford. The models were trained against LLaMA-7B with a subset of the dataset, responses that contained alignment / moralizing were removed.
CLI
|
|
API
|
|
Memory requirements
- 7b models generally require at least 8GB of RAM
- 13b models generally require at least 16GB of RAM
- 30b models generally require at least 32GB of RAM
If you run into issues with higher quantization levels, try using the q4 model or shut down any other programs that are using a lot of memory.