Running large language models (LLMs) locally — with tools like Ollama, vLLM, or LM Studio — gives you private, offline AI with no API costs and complete control. But it's bound by one hard ceiling: VRAM. A model's size (in parameters) and its quantization determine how much VRAM it needs, and if it doesn't fit in your GPU's memory, it either won't run or crawls. This guide walks through building a local LLM inference rig, covering VRAM planning, the dual-3090-vs-single-4090 decision, and the quantization realities that make larger models fit.
It builds on our ML workstation guide and AI-ready workstation guide.
VRAM Is the Hard Ceiling
For LLM inference, your GPU's VRAM sets the absolute limit on which models you can run:
- Bigger models need more VRAM: a model's parameters must (largely) fit in VRAM to run well. A small model fits in 8–12GB; large models (70B-class) need far more.
- NVIDIA again: like image generation, local LLM tooling is built around CUDA, so NVIDIA GPUs are the practical choice.
- VRAM, then VRAM, then speed: whether a model runs at all is about VRAM capacity; how fast it runs is secondary. Buy capacity first.
Quantization: How Big Models Fit Smaller Cards
The crucial concept: quantization compresses a model to use less VRAM (and run faster) at a small quality cost. A model that needs, say, far more memory at full precision can run in a fraction of it when quantized (e.g. to 4-bit). This is what lets enthusiasts run surprisingly large models on consumer cards. The trade-off: heavier quantization saves VRAM but slightly reduces output quality. In practice, sensible quantization (like 4-bit) is the standard way to fit large models locally with minimal noticeable loss — plan your VRAM around quantized model sizes, not full-precision ones.
Dual-3090 vs Single-4090 (The Key Decision)
This is the central hardware choice, and it hinges on VRAM:
- Dual RTX 3090 (2×24 = 48GB): two used 3090s pool their VRAM (with the right software) to 48GB — enough for very large models, often cheaper than one top new card. The legendary value route for serious local LLMs. Needs the PCIe lanes (lane allocation), PSU power, and cooling for two cards.
- Single RTX 4090/5090 (24/32GB): one powerful new card — simpler, more efficient, faster per-token, warrantied, but capped at its single-card VRAM. Better for models that fit in 24–32GB and for simplicity.
- The decision: dual-3090 for maximum VRAM (largest models) on a budget; single new card for simplicity, efficiency, and models that fit. VRAM needs drive the choice.
The Rest of the Rig
- Enough system RAM (32–64GB) to load and manage models before they go to VRAM.
- Fast, generous NVMe storage — LLM files are very large.
- A platform with the PCIe lanes for dual GPUs if going that route — possibly HEDT (Threadripper).
- Substantial PSU and cooling — two 3090s draw heavy sustained power.
The Nigeria Tax
Local LLMs are compelling in Nigeria: no per-token API costs (dollar-priced), full privacy, and offline operation once models are downloaded. The dual-used-3090 route makes large-model inference affordable, but test the cards carefully (used GPU guide), and budget for the power draw and cooling of two cards in our heat. Clean protected power is essential. Plan VRAM around quantized model sizes and buy capacity first.
Frequently Asked Questions
What determines which LLMs I can run locally? VRAM — a model's size and quantization set how much VRAM it needs, and it must (largely) fit in your GPU's memory to run well. Buy VRAM capacity first; speed is secondary to whether the model runs at all. NVIDIA is the practical choice for CUDA tooling.
What is quantization? Compressing a model to use less VRAM and run faster, at a small quality cost. Sensible quantization (like 4-bit) lets you run surprisingly large models on consumer cards with minimal noticeable loss — so plan VRAM around quantized sizes, not full precision.
Dual 3090 or single 4090 for local LLMs? Dual used 3090s pool to 48GB VRAM for the largest models, often cheaper — needing PCIe lanes, power, and cooling for two cards. A single 4090/5090 is simpler, more efficient, and faster per-token but capped at its VRAM. VRAM needs decide it.
The One Thing to Remember
Local LLM inference is VRAM-bound and NVIDIA-based — your GPU's VRAM (and the model's quantization) decide which models run at all, so buy capacity first. Dual used 3090s pool to 48GB for the largest models on a budget; a single new card is simpler and faster for models that fit. Plan VRAM around quantized sizes, support it with ample RAM, storage, power, and cooling, and in Nigeria you get private, offline AI with no API costs.
Building a local LLM rig? Configure a build online → or talk to our team → and we'll plan the VRAM (dual-3090 or single card) for the models you want to run.