You have decided to put a language model into production. Maybe it is a customer support chatbot, a retrieval system that answers questions over your internal documents, or a copilot that drafts emails and summarises reports for your team. The model works in testing. Now you need it to run reliably, every day, fast enough that people do not abandon it, and ideally without sending your customers' data to a server you do not control. That is the job of an inference server, and it is a different machine from the one you trained or fine-tuned on.
This is a practical guide to specifying and running one in Nigeria. If you are still deciding which GPU class fits your workload, start with our guide to building an AI-ready workstation in Nigeria, then come back here for the production-serving angle.
Inference Is Not Training
The single most expensive assumption businesses make is that a serving machine needs the same hardware as a training rig. It does not. Training has to hold the model weights, the gradients, the optimiser states, and a large batch of activations all in memory at once, which is why training rigs are so memory-hungry and often need multiple high-end cards. Inference only has to hold the weights and a much smaller working set. The result is that a single well-chosen GPU often serves a whole team or a live product comfortably.
- Training: VRAM-bound, needs gradients and optimiser states, benefits from multiple GPUs, run in bursts.
- Inference: needs enough VRAM to fit the model plus its working cache, runs continuously, and is judged on throughput and latency rather than raw compute.
If your plan also includes occasional fine-tuning, that is a separate consideration covered in our dual-GPU training rig walkthrough. For pure serving, keep the two budgets apart in your head.
Sizing VRAM to the Model and the KV Cache
Two things consume GPU memory at inference time. First, the model weights, which depend on the parameter count and the precision you run at. A model quantised to a 4-bit format takes roughly half the memory of an 8-bit version and a quarter of full 16-bit, with a usually acceptable quality trade-off for business tasks. Second, the KV cache, which is the running memory of the conversation or document the model is currently processing. The KV cache grows with how long your prompts and outputs are and with how many requests you serve at the same time.
This second point catches people out. A model that fits comfortably for one user can run out of memory when twenty people hit it at once, because each concurrent request carries its own slice of KV cache. So you size VRAM as the model weights plus headroom for the number of simultaneous requests you expect at the longest context you allow. For a deeper treatment of capacity, see how much GPU VRAM you actually need and the plain-English explainer on what VRAM is and why it matters for AI work.
Choosing a Serving Stack
The software you wrap around the model matters as much as the GPU. You are not running the model in a notebook anymore; you need something built to handle many requests efficiently.
- vLLM: the default choice when you want throughput. It uses smart memory management for the KV cache and batches incoming requests together so the GPU stays busy. If you are serving a product or a busy internal tool, this is usually where you start.
- llama.cpp: lighter weight, excellent for smaller models, and able to offload part of the model to system RAM and CPU when it does not fully fit in VRAM. Useful for modest workloads, edge cases, or squeezing a slightly larger model onto a smaller card at the cost of some speed.
- A web layer such as FastAPI: wraps either engine in a clean HTTP interface so your applications call the model like any other internal service, with authentication and logging in front of it.
For most Nigerian businesses the pattern is vLLM behind FastAPI for the main service, with llama.cpp held in reserve for lighter or experimental tasks.
Batching, Concurrency, and the Latency Trade-off
There is one core tension in production inference: latency versus throughput. Latency is how long a single user waits for an answer. Throughput is how many answers the machine produces per second across everyone. These pull against each other. Batching requests together, processing several at once, raises throughput dramatically because the GPU is built for parallel work, but it can add a little to each individual user's wait while the batch fills.
A good serving engine manages this for you with continuous batching, slotting new requests into the work in flight rather than waiting for a fixed batch to complete. For a customer-facing chatbot you tune toward low latency so replies feel instant. For an overnight document-processing job you tune toward throughput and let batches run large. Knowing which one you are optimising for is half the configuration work. If the underlying difference between CPU and GPU work is still fuzzy, our CPU versus GPU explainer is a useful primer.
One GPU That Serves a Whole Team
For a great many business cases a single modern GPU is enough. A mid-to-high-end card with generous VRAM can serve a small or medium model to a department, or power a customer-facing feature with steady traffic, without ever touching a second GPU. This is the sweet spot we build toward most often: one capable card, fast system memory, a sensible CPU to feed it, and storage quick enough to load the model without a long cold start.
What pushes you to a second GPU is one of three things. You need to serve a model too large to fit on one card. Your concurrency has grown past what one card's KV cache can hold at acceptable latency. Or you want redundancy so a single failure does not take the service down. Until you hit one of those, a second GPU is money spent on capacity you are not using. Our local LLM inference rig build walkthrough shows a single-GPU server put together end to end.
Uptime Under Nigerian Power
A serving machine is only useful if it is up, and "up" is the genuinely hard part here. An inference server runs continuously, which means it must survive the grid. The non-negotiables are a UPS or inverter sized to ride through outages and to hand the machine off to your generator cleanly, plus a server configured to restart gracefully and reload the model automatically when power returns. A model that takes two minutes to load back into VRAM after every blip will quietly destroy your uptime numbers.
- Power protection: a UPS or inverter with enough runtime to bridge to generator power, and to shut down cleanly if it cannot.
- Auto-recovery: the service set to start on boot and reload the model without a human present.
- Cooling for the climate: a machine running flat out all day in Nigerian ambient temperatures needs cooling planned for sustained load, not bursts.
We treat this as core engineering, not an afterthought. See optimising a PC for Nigerian power conditions and our guide to the best UPS for extended workstation runtime for the specifics.
Why On-Prem Wins for Business
The case for keeping inference in your own building rests on two pillars. The first is data privacy. When you serve your own model, your customers' messages and your internal documents never leave your premises, which matters for sensitive sectors and for any business that wants to control exactly where its data lives. The second is predictable cost. Cloud inference is billed per token or per hour, in dollars, and a popular feature can produce a frightening bill. An on-prem server is a one-time capital cost in a known currency-exposed range, after which serving an extra thousand requests costs you essentially nothing beyond power. For a business with steady, ongoing inference demand, the on-prem machine usually pays for itself well inside its first year against cloud spend, and the savings only compound as usage grows.
Frequently Asked Questions
How much VRAM does an inference server need? Enough to fit the model at your chosen precision plus headroom for the KV cache of all the requests you serve at once. Quantising the model and capping the maximum context length both reduce the requirement substantially.
Do I need multiple GPUs to serve a model in production? Usually not. A single modern GPU with adequate VRAM serves most team-sized or product workloads. Move to a second GPU only when the model will not fit, concurrency outgrows one card, or you need redundancy.
Is on-prem inference cheaper than the cloud? For steady, ongoing demand, almost always. Cloud billing is per token in dollars and scales with use, while an on-prem server is a fixed upfront cost after which extra requests are nearly free, and it keeps your data in-house.
The Bottom Line
An inference server is a focused machine, not a scaled-down training rig. Size the VRAM to your served model plus its KV cache, pick a serving stack matched to whether you care more about latency or throughput, start with a single capable GPU, and engineer the power and cooling for continuous Nigerian operation. Get those four right and one machine will quietly serve your chatbots, RAG systems, and copilots for years at a cost you can predict.
Ready to put a model into production? Configure an inference server online → or talk to our team → and we will size the GPU, VRAM, and power protection to the exact workload you plan to serve.