Ollama complete guide — run LLMs locally (2026)

Ollama lets you run large language models locally with a single command. This guide covers installation on Mac and Linux, how to pick the right model (with a size table), a CLI cheat sheet, the REST API with curl and Python, customisation with Modelfile and system prompts, and performance tricks that halve VRAM usage.

Ollama lets you run large language models locally with a single command: ollama run llama3.2. No API key, no subscription, no data leaving your machine. This guide takes you from installation to a complete setup — including REST API, Python, and performance tricks.

Installation

Mac

brew install ollama

Start the server:

ollama serve

Or as a background service that starts at login:

brew services start ollama

Linux

curl -fsSL https://ollama.com/install.sh | sh

The installer sets up Ollama as a systemd service that starts automatically. Check status with:

systemctl status ollama

Windows

Download the installer from ollama.com — it sets up as a tray app and starts automatically at login.


Your first model

Pull and run a model with one command:

ollama run llama3.2:3b

Ollama downloads the model on first run (~2 GB), caches it in ~/.ollama/models/, and opens an interactive chat. /bye exits.

To download without opening a chat:

ollama pull qwen3.5

Which model should you choose?

ModelSizeRAM/VRAMBest for
llama3.2:1b1.3 GB2 GBFastest option — good for classification, short answers
llama3.2:3b2.0 GB3 GBBest starting point — capable, runs anywhere
llama3.1:8b4.9 GB6 GBSolid general-purpose model, good for code and text
qwen3.56.6 GB8 GBRecommended — 9.7B, 262k context window
gemma4:12b7.6 GB9 GBGoogle’s model — excellent at following instructions
phi4:14b9.1 GB11 GBMicrosoft’s compact model, strong at reasoning
llama3.1:70b43 GB48 GBPowerful — requires a server with lots of VRAM/RAM
minicpm-v4.61.6 GB2 GBMultimodal — can analyse images

Rule of thumb: pick the largest model that fits in your GPU’s VRAM. The model should ideally sit entirely in VRAM — if it overflows to RAM, speed drops dramatically.

Search available models:

ollama search llama

CLI cheat sheet

# Interactive chat
ollama run qwen3.5

# One-shot prompt
ollama run llama3.2:3b "Explain TCP/IP in three sentences"

# Pipe input
cat article.txt | ollama run qwen3.5 "Summarise this"

# List downloaded models
ollama list

# Show model details (parameters, quantisation, context length)
ollama show qwen3.5

# Delete a model
ollama rm llama3.2:1b

# Copy and rename a model
ollama cp qwen3.5 my-project

# Show running models and VRAM usage
ollama ps

REST API

Ollama exposes an HTTP API on localhost:11434. Use it directly from the terminal or from any programming language.

Generate text

curl http://localhost:11434/api/generate -d '{
  "model": "qwen3.5",
  "prompt": "What is the difference between TCP and UDP?",
  "stream": false
}'

Key fields in the response:

{
  "response": "TCP is connection-oriented...",
  "eval_count": 142,
  "eval_duration": 1923000000,
  "prompt_eval_duration": 215000000
}

eval_count / (eval_duration / 1e9) gives tokens per second.

Chat (with history)

curl http://localhost:11434/api/chat -d '{
  "model": "qwen3.5",
  "messages": [
    {"role": "system", "content": "You are a concise technical assistant."},
    {"role": "user", "content": "What is a hash table?"}
  ],
  "stream": false
}'

Streaming

Set "stream": true (the default) to receive tokens as they are generated:

curl http://localhost:11434/api/generate -d '{
  "model": "qwen3.5",
  "prompt": "Write a poem about terminal programming",
  "stream": true
}' | while IFS= read -r line; do
  echo "$line" | python3 -c "import json,sys; d=json.load(sys.stdin); print(d.get('response',''), end='', flush=True)"
done

Useful options

curl http://localhost:11434/api/generate -d '{
  "model": "qwen3.5",
  "prompt": "...",
  "stream": false,
  "options": {
    "temperature": 0.7,
    "num_predict": 500,
    "num_ctx": 8192,
    "top_p": 0.9,
    "seed": 42
  }
}'
OptionDefaultWhat it does
temperature0.8Creativity: 0 = deterministic, 1+ = more random
num_predict-1 (unlimited)Max tokens to generate
num_ctxmodel’s defaultContext window (tokens)
seedrandomFixed seed → reproducible responses
top_p0.9Nucleus sampling

Python

Ollama’s API is simple enough to use with urllib — no extra packages needed:

import urllib.request, json

def ollama(prompt, model="qwen3.5", system=None):
    messages = []
    if system:
        messages.append({"role": "system", "content": system})
    messages.append({"role": "user", "content": prompt})

    payload = json.dumps({
        "model": model,
        "messages": messages,
        "stream": False,
        "options": {"temperature": 0}
    }).encode()

    req = urllib.request.Request(
        "http://localhost:11434/api/chat",
        data=payload,
        headers={"Content-Type": "application/json"}
    )
    with urllib.request.urlopen(req) as resp:
        return json.load(resp)["message"]["content"]

# Usage
answer = ollama("Explain the difference between list and tuple in Python")
print(answer)

Or with the official package:

pip install ollama
import ollama

response = ollama.chat(
    model="qwen3.5",
    messages=[{"role": "user", "content": "What is Rust?"}]
)
print(response["message"]["content"])

Modelfile — customise a model

A Modelfile lets you create a new model based on an existing one — with a fixed system prompt, temperature, and parameters.

Create the file Modelfile:

FROM qwen3.5

SYSTEM """
You are a precise technical assistant. Be concise.
Show code examples when relevant. Prefer simple over clever.
"""

PARAMETER temperature 0.3
PARAMETER num_ctx 16384

Build and use it:

ollama create my-assistant -f Modelfile
ollama run my-assistant "What is a hash table?"

List your custom models:

ollama list

You can share Modelfiles with colleagues — they pull the base model and run ollama create.


Performance tips

Reduces VRAM usage and increases speed on long contexts:

OLLAMA_FLASH_ATTENTION=1 ollama serve

On Mac via Homebrew, add to ~/.zshrc:

export OLLAMA_FLASH_ATTENTION=1

KV cache quantisation

Cuts context-cache VRAM roughly in half, with minimal quality loss:

OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve   # half VRAM, no visible difference
OLLAMA_KV_CACHE_TYPE=q4_0 ollama serve   # even smaller, slight quality loss

Parallel requests

Ollama handles one request at a time by default. For server use:

OLLAMA_NUM_PARALLEL=4 ollama serve         # 4 concurrent requests
OLLAMA_MAX_LOADED_MODELS=2 ollama serve    # keep 2 models in VRAM

Check what’s running

ollama ps
# NAME            ID      SIZE    PROCESSOR    UNTIL
# qwen3.5:latest  ...   7.9 GB  100% GPU     4 minutes from now

Models unload automatically after 5 minutes of inactivity. Change the timeout:

OLLAMA_KEEP_ALIVE=30m ollama serve   # keep loaded for 30 min
OLLAMA_KEEP_ALIVE=-1 ollama serve    # keep loaded forever

Network access

By default Ollama only listens on localhost. Open it for network access:

OLLAMA_HOST=0.0.0.0:11434 ollama serve

Use Tailscale to expose your GPU server securely on your private network without opening it to the internet:

# On the GPU server:
OLLAMA_HOST=0.0.0.0:11434 ollama serve

# From your laptop via Tailscale:
OLLAMA_HOST=gpu-box:11434 ollama run qwen3.5
# or directly via API:
curl http://gpu-box:11434/api/generate -d '{"model":"gemma4:12b","prompt":"..."}'

On Linux with systemd, add env vars to the service file:

sudo systemctl edit ollama
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
sudo systemctl restart ollama

OpenAI-compatible API

Ollama supports OpenAI’s API format, so existing code that uses the openai package works without changes:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required by the client, ignored by Ollama
)

response = client.chat.completions.create(
    model="qwen3.5",
    messages=[{"role": "user", "content": "What is a linked list?"}]
)
print(response.choices[0].message.content)

What does it cost to run?

Nothing — except electricity. A 7B model on an RTX 4070 Ti (200W under load) costs roughly:

UsePowerCost (€0.30/kWh)
1 hour of chat (~50 requests)~0.1 kWh~€0.03
1 day of batch processing~0.5 kWh~€0.15
GPT-4o via API (same volume)~€5–20

On an M1 Pro (25W), the electricity cost is near-zero.


See also M1 Pro vs RTX 4070 Ti — GPU and LLM benchmarks for concrete numbers on what the two platforms deliver.