Tools / VRAM & Hardware Simulator for On-Premise LLMs
VRAM & Hardware Simulator for On-Premise LLMs
Enter the model size, quantization, and context window to estimate the required GPU memory and compatible hardware configurations.
The main bottleneck for self-hosting an LLM is GPU memory (VRAM). This calculator gives you a precise estimate based on real model parameters: number of layers, GQA attention heads, and context window.
Required VRAM
5.8 GB
Compatible configurations
RTX 4080 Super
Consumer16 GBRTX 3090 / RTX 4090
Consumer24 GBRTX A5000
Pro Workstation24 GBRTX A6000 / L40S
Pro Workstation48 GB
How to calculate VRAM for an LLM
Required GPU memory breaks down into two parts: model memory and KV Cache. Model memory depends directly on the number of parameters and the chosen numerical precision. A 7 billion parameter model in FP16 (2 bytes per parameter) occupies around 16.8 GB including system overhead. Quantization reduces this footprint: Q4_K_M divides the size by four compared to FP16, making inference possible on consumer GPUs like the RTX 4090 even for 13B to 34B models.
Which GPU for Llama 3 and Mistral
Llama 3 8B quantized in Q4_K_M requires approximately 5 to 6 GB of VRAM for inference, making it compatible with an RTX 3080 or above. The 70B model requires a more serious configuration: at minimum an A100 80GB or two RTX 4090s using tensor parallelism over PCIe (NVLink was removed from the RTX 40 series). For enterprise use with long contexts (32k to 128k tokens), budget additional headroom for the KV Cache, which can reach several tens of gigabytes on the longest contexts.
Hosting an LLM on-premise in your company
On-premise LLM hosting offers concrete advantages: no data leaves your infrastructure, per-request costs are eliminated, and you maintain full control over your AI pipeline. From a GDPR perspective, it is the most robust solution for processing sensitive data (contracts, patient data, financial information). A typical SMB configuration consists of a server with 1 or 2 RTX 4090s, sufficient to run a quantized 13B to 34B model in production.
Frequently asked questions
What is the difference between FP16, INT8, and Q4_K_M?
FP16 uses 2 bytes per parameter for maximum precision. INT8 reduces to 1 byte with a slight quality loss. Q4_K_M uses 4 bits per parameter (0.5 bytes): it is the standard trade-off from the llama.cpp community, with barely perceptible quality loss for most tasks.
Can the KV Cache be offloaded to CPU RAM?
Yes, some frameworks (llama.cpp, exllama2) allow offloading part of the KV Cache to system RAM, but at the cost of significantly higher latency. For smooth inference in production, the entire KV Cache should fit in VRAM.
Are these estimates accurate for all models?
This simulator uses standard Llama/Mistral/Qwen-type architectures. Models with different architectures (Falcon, Mamba, etc.) may have slightly different requirements. Estimates include a 20% margin and are intentionally conservative.
Can Apple Silicon Macs run an LLM in production?
Yes, thanks to unified memory, M2/M3 Ultra Macs can run 70B models. Memory bandwidth remains lower than NVIDIA datacenter GPUs, but the performance-to-cost ratio is excellent for lightweight on-premise deployment or prototyping.
Deploying open-source models on your servers without data leakage is my specialty.
GPU architecture, model selection, quantization, GDPR integration: I guide you from hardware estimation to production deployment.
Discuss your infrastructure project