Running a capable LLM locally in 2026 is no longer a research project. The 70B-class open models, Llama 3 70B, Mistral Large, Qwen 2.5, match GPT-3.5 quality on most tasks, sometimes hit GPT-4 territory on specific ones, and run on a single consumer-grade workstation. For privacy-sensitive work, legal review, medical-record summarisation, malware analysis, security research, local AI is the right tool. This tutorial walks through the build end to end.
Step 1: Hardware
Llama 3 70B at 4-bit quantisation needs roughly 40 GB of memory. Three workable hardware paths:
Single GPU with enough VRAM, RTX 4090 (24 GB) won’t fit a 70B; RTX 5090 (32 GB) doesn’t either. You need an A6000 (48 GB, ~$4500 used) or two RTX 4090s in parallel.
Apple Silicon, M2 Ultra Mac Studio with 128 GB unified memory (~$5000) runs 70B at usable speeds (10-15 tokens/sec). M3 Max MacBook Pro with 128 GB also works for development. The unified memory architecture makes Apple unusually well-suited.
CPU + system RAM, slow but free if you have 64+ GB DDR5 already. Expect 2-4 tokens/sec, which is too slow for chat but fine for batch jobs.
For most readers the M2/M3 Ultra Mac Studio is the cleanest answer. For production-leaning setups, dual RTX 4090s with NVLink.
Step 2: Install Ollama
Ollama is the cleanest local-LLM runtime in 2026, it handles model downloads, quantisation, GPU acceleration, and serves a local API.
Mac/Linux: curl -fsSL https://ollama.com/install.sh | sh
Windows: download installer from ollama.com.
Pull the model:
ollama pull llama3.3:70b
That downloads ~40 GB. Test it:
ollama run llama3.3:70b "Explain prompt injection in two sentences."
Step 3: Install Open WebUI for a chat interface
Ollama on its own is a CLI/API. For a ChatGPT-like web UI, install Open WebUI. Cleanest way is Docker:
docker run -d -p 3000:8080 \ --add-host=host.docker.internal:host-gateway \ -v open-webui:/app/backend/data \ --name open-webui --restart always \ ghcr.io/open-webui/open-webui:main
Browse to http://localhost:3000, create the admin account (first user becomes admin), and you have a chat UI talking to your local Llama. Multi-user auth, conversation history, document upload for RAG, all included.
Step 4: Bind correctly so it doesn’t leak
By default Ollama binds to 127.0.0.1, which is correct. If you want to access it from another machine on your local network, set OLLAMA_HOST=0.0.0.0:11434, but only do that on a network you trust, ideally behind a firewall and a VLAN that doesn’t reach the internet.
Do not expose Ollama or Open WebUI to the public internet. They have no auth by default and any drive-by scanner finds them within hours.
Step 5: Alternative tooling worth knowing
LM Studio, desktop GUI for browsing and running local models, no command line. Easier for non-technical users.
Jan, open-source ChatGPT alternative that runs entirely locally, with a polished UI.
Hugging Face, the underlying model marketplace. If you want to verify weights against published hashes before downloading, hugging face is the source of truth.
Step 6: Use cases that justify the setup
Don’t run local AI for every prompt, cloud models are still better and cheaper for general use. Run local for:
- Reviewing leaked datasets, malware samples, or anything sensitive that shouldn’t leave your perimeter
- Privileged legal or medical document review where data residency is contractual
- Bulk processing where you’d hit cloud rate limits or rack up significant API spend
- Offline work, flights, conferences with hostile networks, sensitive client sites
For everything else, cloud is fine. The hybrid setup, Anthropic for general chat, local Llama for sensitive work, is what most practitioners actually run in 2026.
