Downloading and executing large language models locally can be beneficial in terms of privacy and cost considerations. The data stays on the device locally and isn’t subjected to third-party terms of service, also, there are no inference costs involved that allow execution of tasks requiring a large amount of tokens. Ollama is one of the most used tools for running and executing LLMs locally.
This article explores the Ollama platform for downloading and running LLMs locally.
Table of Content
- Understanding Ollama
- Local Execution of LLMs using Ollama Shell
- Ollama API Calling through Python
Understanding Ollama
Ollama is an open-source tool designed to assist users in setting up and running large language models such as Phi2, Llama3, etc. locally. This tool is built based on llama.cpp, a C++ library specifically designed for efficient local inference of LLMs on different platforms and hardware configurations. Llama.cpp is an inference of Meta’s base Llama model in pure C/C++ and it supports a huge range of LLMs for finetuning and other advanced tasks.
Ollama supports a wide range of LLMs which can be viewed on the official website of Ollama – https://ollama.com/library and Git repo (https://github.com/ollama/ollama) as well.
Ollama Supported Model Examples
Local Execution of LLMs using Ollama Shell
Step 1: Visit https://ollama.com/download, download the Ollama installer and install it.
Step 2: Once the installation is finished, run Ollama and check if it’s under execution through the terminal command –
sachintripathi@Sachins-MacBook-Air ~ % ollama
Step 3: Download and run the open-source LLM model of your choice. I’m using Phi-2 model for demonstration here –
sachintripathi@Sachins-MacBook-Air ~ % ollama run phi
Step 4: Once the model is pulled and executed, we can provide prompts and generate responses.
>>> can you tell me about India in brief?
Step 5: Let’s exit the Ollama shell using /? command
>>> /?
Step 6: Let’s execute a prompt and save its response in a text file for easy accessibility –
sachintripathi@Sachins-MacBook-Air ~ % ollama run phi "can you tell me about India in brief?" >> response.md
The response.md is generated with the response of the entered prompt.
Ollama API Calling through Python
Ollama can also be implemented locally using API calls if the Ollama server is running.
Step 1: Create a Python script and import requests and JSON packages –
import json
import requests
Step 2: Create the URL, headers and data variables encapsulating the Ollama localhost URL, content type and model details –
url = "http://localhost:11434/api/generate"
headers = {
"Content-Type": "application/json"
}
data = {
"model": "phi",
"prompt": "can you tell me about India in brief?",
"stream": False
}
Step 3: Implement the post method of the response package and pass the artefacts declared in Step 2. Store the response in a variable and extract the response text from the JSON object using the code shown below –
response = requests.post(url, headers = headers, data = json.dumps(data))
if response.status_code == 200:
response_text = response.text
data = json.loads(response_text)
actual_response = data['response']
print(actual_response)
else:
print("Error:", response.status_code, response.text)
Output
The execution of the script generates the response as per the given prompt based on Phi2 open-source LLM which ran locally.
Final Words
By enabling users to execute and prompt LLMs locally on their machines, Ollama empowers a wider range of audience to use and learn Generative AI safely and easily. This open-source project allows the LLMs to be more easily accessible, breaking down the cost and privacy limitations and allowing the users to innovate and research easily.