Name: Pierre Kasparian - Intégration IA freelance
Rating: 5

Question 1

What is the difference between FP16, INT8, and Q4_K_M?

Accepted Answer

FP16 uses 2 bytes per parameter for maximum precision. INT8 reduces to 1 byte with a slight quality loss. Q4_K_M uses 4 bits per parameter (0.5 bytes): it is the standard trade-off from the llama.cpp community, with barely perceptible quality loss for most tasks.

Question 2

Can the KV Cache be offloaded to CPU RAM?

Accepted Answer

Yes, some frameworks (llama.cpp, exllama2) allow offloading part of the KV Cache to system RAM, but at the cost of significantly higher latency. For smooth inference in production, the entire KV Cache should fit in VRAM.

Question 3

Are these estimates accurate for all models?

Accepted Answer

This simulator uses standard Llama/Mistral/Qwen-type architectures. Models with different architectures (Falcon, Mamba, etc.) may have slightly different requirements. Estimates include a 20% margin and are intentionally conservative.

Question 4

Can Apple Silicon Macs run an LLM in production?

Accepted Answer

Yes, thanks to unified memory, M2/M3 Ultra Macs can run 70B models. Memory bandwidth remains lower than NVIDIA datacenter GPUs, but the performance-to-cost ratio is excellent for lightweight on-premise deployment or prototyping.

VRAM & Hardware Simulator for On-Premise LLMs

How to calculate VRAM for an LLM

Which GPU for Llama 3 and Mistral

Hosting an LLM on-premise in your company

Frequently asked questions