Run Your Own AI Code Assistant with Ollama

Most "AI coding assistant" tutorials assume you're fine piping your entire codebase through someone else's API. I'm not. If you're building anything with proprietary logic, client data, or just a healthy distrust of SaaS pricing cliffs, that's a problem.

Ollama lets you run large language models locally — on your own machine or a self-hosted box — with a dead-simple HTTP API that mimics OpenAI's. That last part matters: it means every editor plugin, every script, every tool already wired to OpenAI can point at your local instance instead. Zero vendor lock-in, zero per-token billing, and your code never leaves your network.

This tutorial walks through spinning up Ollama, picking the right model for code tasks, wiring it into VS Code via Continue.dev, and setting up a tiny shell helper for one-off questions from the terminal. I'm running this on a self-hosted Ubuntu 22.04 box with an RTX 3080 (10 GB VRAM), but I'll note where a CPU-only or Apple Silicon path diverges.

Why Ollama Over the Alternatives

There are a few ways to run local LLMs: llama.cpp directly, LM Studio (GUI, macOS/Windows only), Jan.ai, or Ollama. I've tried all of them. Ollama wins for a server/headless setup because:

It ships a proper HTTP server on port 11434 out of the box
Model management is ollama pull <model> — no manual GGUF wrangling
The API is a near-drop-in for OpenAI's /v1/chat/completions
It handles GPU offloading automatically without you touching CUDA flags

LM Studio is fine if you want a pretty GUI on a laptop. For a self-hosted dev environment that starts on boot and serves multiple clients, Ollama is the right tool.

Installing Ollama on Ubuntu (and macOS)

Ubuntu / Debian (with NVIDIA GPU):

First, make sure you have the NVIDIA drivers and CUDA toolkit installed. Then:

curl -fsSL https://ollama.com/install.sh | sh

Yes, curl-pipe-sh. I know. Check the script first if that bothers you — it's readable. As of Ollama 0.1.38 (May 2025), the installer sets up a systemd service automatically.

Verify it's running:

systemctl status ollama
# should show: active (running)

macOS (Apple Silicon):

Download the .dmg from ollama.com or:

brew install ollama
brew services start ollama

On M1/M2/M3, Ollama uses Metal for GPU acceleration. A 7B model runs comfortably on 8 GB unified memory.

Expose Ollama to your local network (optional):

By default Ollama binds to 127.0.0.1. If you want other machines on your LAN (or a VM) to hit it:

# /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"

Then systemctl daemon-reload && systemctl restart ollama.

Picking the Right Model for Code

Not all models are equal for coding tasks. Here's what I've actually tested:

Model	Size on disk	VRAM needed	Code quality	Speed (RTX 3080)
`qwen2.5-coder:7b`	~4.7 GB	~5 GB	★★★★☆	~35 tok/s
`qwen2.5-coder:14b`	~9 GB	~10 GB	★★★★★	~18 tok/s
`deepseek-coder-v2:16b`	~10 GB	~11 GB	★★★★★	~15 tok/s
`codellama:7b`	~3.8 GB	~5 GB	★★★☆☆	~40 tok/s
`llama3.1:8b`	~4.7 GB	~5 GB	★★★☆☆	~35 tok/s

My daily driver is qwen2.5-coder:14b. It fits exactly in 10 GB VRAM and the code completions are genuinely useful — not just plausible-looking nonsense. On CPU-only hardware, drop to qwen2.5-coder:7b; it's slower but still coherent.

Pull it:

ollama pull qwen2.5-coder:14b

Test it immediately:

ollama run qwen2.5-coder:14b "Write a Python function that retries a failed HTTP request with exponential backoff."

If you get a coherent function back in under 30 seconds, your setup is working.

Wiring Ollama into VS Code with Continue.dev

Continue (continue.dev) is an open-source VS Code and JetBrains extension that gives you a ChatGPT-style sidebar plus inline completions. It supports any OpenAI-compatible endpoint — which is exactly what Ollama exposes.

Install the extension:

In VS Code: Ctrl+Shift+X → search "Continue" → install.

Configure it to use your Ollama instance:

Open ~/.continue/config.json (Continue creates this on first launch). Replace the models array:

{
  "models": [
    {
      "title": "Qwen2.5 Coder 14B (local)",
      "provider": "ollama",
      "model": "qwen2.5-coder:14b",
      "apiBase": "http://localhost:11434"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen2.5 Coder 7B (autocomplete)",
    "provider": "ollama",
    "model": "qwen2.5-coder:7b",
    "apiBase": "http://localhost:11434"
  },
  "allowAnonymousTelemetry": false
}

I use the 14B model for chat (where latency is fine) and the 7B for inline autocomplete (where you need tokens fast). The allowAnonymousTelemetry: false is worth setting explicitly.

Once saved, the Continue sidebar (Cmd/Ctrl+L) will show your local model. Highlight a function, hit Cmd+L, and ask it to refactor — it works exactly like GitHub Copilot Chat, except the request never leaves your machine.

Tip: If Ollama is running on a different machine on your LAN, change apiBase to http://192.168.1.x:11434. That's it. One config line.

A Terminal Helper That Actually Gets Used

The VS Code integration is great, but I spend a lot of time in the terminal. I wanted ai "what does this awk command do" to just work. Here's the shell function I've been using for months:

# Add to ~/.bashrc or ~/.zshrc
ai() {
  local prompt="$*"
  if [ -z "$prompt" ]; then
    echo "Usage: ai <question>"
    return 1
  fi
  curl -s http://localhost:11434/api/chat \
    -H "Content-Type: application/json" \
    -d "$(jq -n \
      --arg content "$prompt" \
      '{model: "qwen2.5-coder:14b", messages: [{role: "user", content: $content}], stream: false}'\
    )" | jq -r '.message.content'
}

Source your shell config and try it:

source ~/.zshrc
ai "explain this bash one-liner: awk 'NR%2==0' file.txt"

You need jq installed (apt install jq / brew install jq). The stream: false flag makes it wait for the full response before printing — cleaner for terminal output.

For longer sessions where you want to pipe in a file:

ai-file() {
  local file="$1"
  shift
  local question="$*"
  local content
  content=$(cat "$file")
  ai "Given this file:\n\n$content\n\n$question"
}

# Usage:
ai-file src/auth.py "What are the security issues in this code?"

This is where the self-hosted angle really pays off. Piping a real auth module to OpenAI is the kind of thing that should make you nervous. Piping it to localhost is fine.

Making It Start on Boot and Stay Running

Ollama's systemd service handles this already on Linux. But there's one thing worth configuring: keeping your most-used model loaded in memory so the first request isn't slow.

Ollama unloads models from VRAM after 5 minutes of inactivity by default. You can override this with OLLAMA_KEEP_ALIVE:

# /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_KEEP_ALIVE=24h"

With this set, once you load a model it stays warm all day. On a dedicated dev box where you're not doing anything else GPU-intensive, this is the right call.

Check what's currently loaded:

curl http://localhost:11434/api/ps | jq '.models[].name'

Pre-warm the model at startup with a tiny cron job or systemd ExecStartPost:

# /etc/systemd/system/ollama-warmup.service
[Unit]
Description=Warm up Ollama model
After=ollama.service

[Service]
Type=oneshot
ExecStart=/usr/bin/curl -s -X POST http://localhost:11434/api/generate \
  -d '{"model": "qwen2.5-coder:14b", "prompt": "hi", "stream": false}'

[Install]
WantedBy=multi-user.target

Enable it: systemctl enable --now ollama-warmup.service

First request of the day is now instant instead of a 10-second model load.

What This Setup Actually Costs

Let me be concrete, because "free" is relative.

Hardware path: My RTX 3080 box was already running as a home server. The GPU idles at ~~15W. Running inference peaks around 200W. At US average electricity rates (~~$0.16/kWh), a full 8-hour day of heavy use costs about $0.26. Compare that to GPT-4o at roughly $15 per million output tokens — a few hours of active coding assistance can easily hit $5-10/day on OpenAI.

No GPU path: A Mac Mini M4 (starts at $599 as of early 2025) runs the 7B model at a perfectly usable 20-25 tokens/second. If you're buying new hardware specifically for this, that's the recommendation. It's also dead quiet and uses 10-20W at load.

CPU-only on a Linux box: Painful. A 7B model on a modern Ryzen 9 does about 3-5 tokens/second. Usable for non-interactive tasks (batch analysis, file review), but too slow for inline autocomplete.

If you're already running a home server or have a Mac with decent unified memory, the marginal cost of adding a self-hosted AI code assistant is essentially zero.

Conclusion: Set This Up This Weekend

Self-hosted AI code assistants aren't a compromise anymore. qwen2.5-coder:14b through Ollama is genuinely competitive with Copilot for the tasks I use it for — refactoring, explaining unfamiliar code, writing tests. It's not GPT-4o, but it's running on my hardware, it's always available offline, and it's never going to surprise me with a 3x price increase.

Here's what to do tomorrow: install Ollama, pull qwen2.5-coder:7b (if you're on CPU) or qwen2.5-coder:14b (if you have a GPU), install Continue in VS Code, and point it at localhost:11434. The whole setup takes under 20 minutes. Run it for a week and see if you still feel like you need the cloud version.

If you're already running a self-hosted dev environment, this fits right in — check out how I set up a local dev environment with Docker Compose for the broader context on keeping your whole stack off vendor infrastructure.