Self-Hosting Your AI Assistant on a Linux Server
I run my own AI assistant — Ellie — on a Linux box in my apartment. Not on AWS, not on Azure, not through an API. A physical machine under my desk, connected to my router, running 24/7. Here's why, and more importantly, here's how.
Why Self-Host?
Three reasons: privacy, cost, and control.
Privacy: everything Ellie processes stays on my network. My emails, calendar events, personal messages — none of it touches a third-party server for LLM processing. When I ask Ellie to summarize my inbox, the email content never leaves my machine.
Cost: running a 7B model locally costs electricity. About €15/month for the GPU running at ~200W average. Compare that to GPT-4 API costs for the same volume of requests — I'd be spending €100+/month easily, and I make thousands of requests per day between automated checks and manual queries.
Control: when OpenAI changes their API, my assistant doesn't break. When Claude has an outage, my local model keeps working. I control the model version, the system prompt, the rate limits — everything.
Hardware Setup
My server is a desktop PC repurposed as a headless server:
- CPU: AMD Ryzen 7 5800X (8 cores, useful for CPU-bound preprocessing)
- GPU: NVIDIA RTX 3080 Ti (12 GB VRAM — runs 7B-13B models comfortably)
- RAM: 32 GB DDR4-3200 (plenty for the OS, model loading, and concurrent services)
- Storage: 1 TB NVMe for the OS and models, plus a RAID array for bulk storage
- OS: Ubuntu 24.04 LTS (boring and reliable, exactly what a server needs)
Total cost for this hardware: about €800, bought mostly second-hand. The GPU was the expensive part — got it for €450 from someone upgrading to a 4090.
Step 1: Install Ollama
Ollama is the easiest way to run LLMs locally. One command to install, one command to run a model:
curl -fsSL https://ollama.com/install.sh | sh ollama pull qwen2.5:7b ollama run qwen2.5:7b "Hello, what can you do?"
That's it. Ollama handles model downloading, GGUF format conversion, GPU detection, and serves an OpenAI-compatible API on port 11434. For a self-hosted setup, this is the foundation.
I run Ollama as a systemd service so it starts on boot and restarts on crash:
# /etc/systemd/system/ollama.service [Unit] Description=Ollama LLM Server After=network.target [Service] Type=simple ExecStart=/usr/local/bin/ollama serve Restart=always RestartSec=5 Environment="OLLAMA_HOST=0.0.0.0" [Install] WantedBy=multi-user.target
Setting OLLAMA_HOST=0.0.0.0 makes it accessible from other machines on my network. My Windows desktop can use the same model server, which is useful for development.
Step 2: Choose Your Model
Model selection depends on your VRAM. Here's my cheat sheet:
- 8 GB VRAM: Qwen 2.5 7B Q4_K_M (~5 GB), Llama 3.1 8B Q4_K_M (~5.5 GB)
- 12 GB VRAM: Qwen 2.5 14B Q4_K_M (~9 GB), Mistral Nemo 12B Q5_K_M (~9.5 GB)
- 24 GB VRAM: Qwen 2.5 32B Q4_K_M (~19 GB), Llama 3.1 70B Q2_K (~27 GB with some CPU offload)
I primarily use Qwen 2.5 7B for fast responses (about 80 tokens/second on my 3080 Ti) and Qwen 2.5 14B for tasks that need better reasoning (about 35 tokens/second). The 14B fits in 12 GB with Q4 quantization and leaves enough VRAM headroom for the KV cache.
Step 3: Build the Assistant Layer
Ollama gives you an LLM. But an LLM isn't an assistant. An assistant needs:
- Memory: Persistence across conversations. I use a file-based system — daily logs in Markdown files, with a curated long-term memory file.
- Tools: The ability to check email, read files, run commands, browse the web. I expose these as function calls that the LLM can invoke.
- Proactivity: Heartbeat checks that run every 30 minutes — scan for new emails, upcoming calendar events, system alerts.
- Multi-model routing: Use the local model for routine tasks, but route complex reasoning to Claude or GPT via API when needed.
The orchestration layer is a Node.js application that manages conversations, dispatches tool calls, and handles the heartbeat loop. It talks to Ollama via the HTTP API and to external models via their respective APIs.
Step 4: Connect Your Services
The assistant becomes useful when it can access your digital life. Here's what I've connected:
- Gmail: Via Google Workspace CLI tool (
gog), authenticated with OAuth. Ellie can search, read, and draft emails. - Google Calendar: Same tool. She checks upcoming events and sends me reminders.
- File system: Direct access to my project directories. She can read code, check git status, review logs.
- Web search: Via Brave Search API (free tier, 2000 queries/month). For quick fact-checking and research.
- Messaging: WhatsApp and Discord integrations for sending me notifications.
Step 5: Make It Persistent
A server that needs babysitting isn't a server. Here's my reliability checklist:
- systemd services: Both Ollama and the assistant app run as systemd units with
Restart=always - Unattended upgrades: Security patches install automatically. I review and apply kernel updates manually.
- UPS: A basic APC 700VA UPS gives me about 15 minutes of battery, enough to survive brief power outages (common in Portugal).
- Monitoring: A simple cron job that pings the Ollama API every 5 minutes and sends me a WhatsApp message if it fails 3 times in a row.
- Backups: Daily rsync of configuration and memory files to an external drive.
The Reality Check
Self-hosting an AI assistant is not for everyone. It requires Linux system administration skills, comfort with the command line, and willingness to debug issues at 2 AM when something crashes. The local models, while impressive, are noticeably less capable than GPT-4 or Claude Opus for complex reasoning tasks.
But for someone who values privacy, enjoys tinkering, and wants an always-on assistant that doesn't depend on external services — it's absolutely worth it. My electricity bill went up by €15/month. My API costs went down by €100/month. And I have full control over every aspect of the system.
If you have an old gaming PC gathering dust, you already have the hardware. Give it a shot.