February 12, 2026 · AI & DevOps · 11 min read

Self-Hosting Your AI Assistant on a Linux Server

I run my own AI assistant — Ellie — on a Linux box in my apartment. Not on AWS, not on Azure, not through an API. A physical machine under my desk, connected to my router, running 24/7. Here's why, and more importantly, here's how.

Why Self-Host?

Three reasons: privacy, cost, and control.

Privacy: everything Ellie processes stays on my network. My emails, calendar events, personal messages — none of it touches a third-party server for LLM processing. When I ask Ellie to summarize my inbox, the email content never leaves my machine.

Cost: running a 7B model locally costs electricity. About €15/month for the GPU running at ~200W average. Compare that to GPT-4 API costs for the same volume of requests — I'd be spending €100+/month easily, and I make thousands of requests per day between automated checks and manual queries.

Control: when OpenAI changes their API, my assistant doesn't break. When Claude has an outage, my local model keeps working. I control the model version, the system prompt, the rate limits — everything.

Hardware Setup

My server is a desktop PC repurposed as a headless server:

CPU: AMD Ryzen 7 5800X (8 cores, useful for CPU-bound preprocessing)
GPU: NVIDIA RTX 3080 Ti (12 GB VRAM — runs 7B-13B models comfortably)
RAM: 32 GB DDR4-3200 (plenty for the OS, model loading, and concurrent services)
Storage: 1 TB NVMe for the OS and models, plus a RAID array for bulk storage
OS: Ubuntu 24.04 LTS (boring and reliable, exactly what a server needs)

Total cost for this hardware: about €800, bought mostly second-hand. The GPU was the expensive part — got it for €450 from someone upgrading to a 4090.

Step 1: Install Ollama

Ollama is the easiest way to run LLMs locally. One command to install, one command to run a model:

curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen2.5:7b
ollama run qwen2.5:7b "Hello, what can you do?"

That's it. Ollama handles model downloading, GGUF format conversion, GPU detection, and serves an OpenAI-compatible API on port 11434. For a self-hosted setup, this is the foundation.

I run Ollama as a systemd service so it starts on boot and restarts on crash:

# /etc/systemd/system/ollama.service
[Unit]
Description=Ollama LLM Server
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/ollama serve
Restart=always
RestartSec=5
Environment="OLLAMA_HOST=0.0.0.0"

[Install]
WantedBy=multi-user.target

Setting OLLAMA_HOST=0.0.0.0 makes it accessible from other machines on my network. My Windows desktop can use the same model server, which is useful for development.

Step 2: Choose Your Model

Model selection depends on your VRAM. Here's my cheat sheet:

8 GB VRAM: Qwen 2.5 7B Q4_K_M (~5 GB), Llama 3.1 8B Q4_K_M (~5.5 GB)
12 GB VRAM: Qwen 2.5 14B Q4_K_M (~9 GB), Mistral Nemo 12B Q5_K_M (~9.5 GB)
24 GB VRAM: Qwen 2.5 32B Q4_K_M (~19 GB), Llama 3.1 70B Q2_K (~27 GB with some CPU offload)

I primarily use Qwen 2.5 7B for fast responses (about 80 tokens/second on my 3080 Ti) and Qwen 2.5 14B for tasks that need better reasoning (about 35 tokens/second). The 14B fits in 12 GB with Q4 quantization and leaves enough VRAM headroom for the KV cache.

Step 3: Build the Assistant Layer

Ollama gives you an LLM. But an LLM isn't an assistant. An assistant needs:

Memory: Persistence across conversations. I use a file-based system — daily logs in Markdown files, with a curated long-term memory file.
Tools: The ability to check email, read files, run commands, browse the web. I expose these as function calls that the LLM can invoke.
Proactivity: Heartbeat checks that run every 30 minutes — scan for new emails, upcoming calendar events, system alerts.
Multi-model routing: Use the local model for routine tasks, but route complex reasoning to Claude or GPT via API when needed.

The orchestration layer is a Node.js application that manages conversations, dispatches tool calls, and handles the heartbeat loop. It talks to Ollama via the HTTP API and to external models via their respective APIs.

Step 4: Connect Your Services

The assistant becomes useful when it can access your digital life. Here's what I've connected:

Gmail: Via Google Workspace CLI tool (gog), authenticated with OAuth. Ellie can search, read, and draft emails.
Google Calendar: Same tool. She checks upcoming events and sends me reminders.
File system: Direct access to my project directories. She can read code, check git status, review logs.
Web search: Via Brave Search API (free tier, 2000 queries/month). For quick fact-checking and research.
Messaging: WhatsApp and Discord integrations for sending me notifications.

Step 5: Make It Persistent

A server that needs babysitting isn't a server. Here's my reliability checklist:

systemd services: Both Ollama and the assistant app run as systemd units with Restart=always
Unattended upgrades: Security patches install automatically. I review and apply kernel updates manually.
UPS: A basic APC 700VA UPS gives me about 15 minutes of battery, enough to survive brief power outages (common in Portugal).
Monitoring: A simple cron job that pings the Ollama API every 5 minutes and sends me a WhatsApp message if it fails 3 times in a row.
Backups: Daily rsync of configuration and memory files to an external drive.

The Reality Check

Self-hosting an AI assistant is not for everyone. It requires Linux system administration skills, comfort with the command line, and willingness to debug issues at 2 AM when something crashes. The local models, while impressive, are noticeably less capable than GPT-4 or Claude Opus for complex reasoning tasks.

But for someone who values privacy, enjoys tinkering, and wants an always-on assistant that doesn't depend on external services — it's absolutely worth it. My electricity bill went up by €15/month. My API costs went down by €100/month. And I have full control over every aspect of the system.

If you have an old gaming PC gathering dust, you already have the hardware. Give it a shot.

💬 Comments

← Back to Blog