Name: Pierre Kasparian - Intégration IA freelance
Rating: 5

An LLM that responds in 800 ms costs ten times less to operate than one that takes 4 seconds, at the same volume. Yet most teams integrating language models treat inference as a black box: call the API, wait, display the result.

Direct answer: LLM inference splits into two phases with opposite physical constraints: prefill (compute-bound) and decode (memory-bandwidth-bound). Understanding this distinction lets you choose the right optimizations, estimate real costs, and decide when self-hosting becomes economically and legally justified, especially for organizations under GDPR obligations.

Why inference engineering changes the equations

For years, running AI in production meant calling an API from a US-based provider. Three converging forces are changing this picture.

Open-source models have caught up with proprietary ones. Hugging Face now hosts over two million open models, roughly 25 times what existed five years ago. Models like DeepSeek V3, Mistral Large, or Qwen 2.5 deliver near-frontier performance on most enterprise tasks. Self-hosting is no longer a quality compromise.

GDPR creates a real constraint. Article 44 of the GDPR prohibits personal data transfers outside the EU without adequate guarantees. The US CLOUD Act of 2018 allows American authorities to access data held by US companies, even on their European infrastructure. Sending customer documents, emails, or contract data to an API hosted outside the EU creates a real compliance risk that a DPA alone rarely covers.

Self-hosting has become economically viable. At sufficient volume, costs drop by around 80% compared to public APIs. Cursor built Composer 2.0 on an open model with autocomplete latency below what cloud APIs deliver, precisely because dedicated infrastructure can be optimized for a specific traffic profile.

How LLM inference works: prefill vs decode

When a request arrives at an LLM, two operations run in sequence on the same GPU, each with radically different physical constraints.

The prefill phase processes the entire input prompt in parallel. The GPU runs all input tokens simultaneously through every layer of the model. The outputs are the first response token and a KV cache (stored intermediate attention values for future reference). This phase is compute-bound: more raw flops means faster prefill. The key metric is TTFT (Time To First Token).

The decode phase generates each subsequent token one at a time, in sequence. Each token requires a full model forward pass, reading weights from memory. This phase is memory-bandwidth-bound: the GPU spends most of its cycles moving data, not computing. The key metric is TPS (Tokens Per Second).

This asymmetry is the foundation of inference engineering: a technique that accelerates prefill does not necessarily affect decode, and vice versa.

The 6 core optimization techniques

Batching

Batching processes multiple requests simultaneously on the same GPU, token by token. Throughput increases significantly because the GPU runs at full utilization. The trade-off: higher per-user latency. Public APIs maximize batching for economy of scale; a dedicated deployment can tune the ratio to match specific needs (latency vs throughput).

Prefix caching

When two requests share a common prefix (a long system prompt repeated across thousands of users), the inference engine computes that prefix once and reuses the KV cache. This is why providers charge less for cached tokens. In practice: put shared content early in the prompt, user-specific variables at the end, to maximize cache hits.

Quantization

Quantization reduces the numerical precision of model weights: from fp16 to int8 or int4. Weights occupy less memory, load faster, and math operations run quicker. Typical gain: 30 to 50% better performance. The trade-off: slight quality degradation in certain layers. In practice, attention layers are kept at full precision (errors accumulate there), and the rest is quantized.

Speculative decoding

Generating a token from scratch is expensive. Verifying a candidate token is much cheaper. Speculative decoding exploits this asymmetry: a small draft model predicts several tokens ahead, and the main model verifies all of them in a single forward pass. Result: multiple tokens emitted per pass instead of one. The gain is on TPS, not TTFT. The technique is dynamically disabled at high batch loads when the GPU is already saturated.

Parallelism

For large models that do not fit on a single GPU, two strategies dominate: tensor parallelism (each layer split across multiple GPUs, suited for dense models, requires high-bandwidth interconnects) and expert parallelism (for MoE models, each expert on a different GPU). Most production deployments combine both.

Disaggregation

Disaggregation takes the prefill/decode split to its logical conclusion: running the two phases on separate machines. The prefill engine processes prompts and sends the KV cache over the network to the decode engine. Each set of machines can be sized and optimized independently. This is the most architecturally complex step, reserved for high-scale deployments.

When to invest in self-hosting

The question is not "is inference engineering better?" but "at what product stage is the investment justified?"

Three signals indicate the equation has shifted:

Signal	Description
API cost	The API bill has become a significant line in the P&L
Latency	Public APIs cannot meet the target SLA
Reliability	Provider SLAs are no longer sufficient for customer commitments

For GDPR-bound organizations, a fourth signal applies: the nature of the data processed. If LLM requests include personal data (names, emails, contract data), self-hosting on EU infrastructure is the most robust architecture for avoiding transfers outside the EU and CLOUD Act exposure.

Concrete options:

Mistral AI (La Plateforme): cloud API, French company, EU-hosted infrastructure, outside CLOUD Act reach. A practical intermediate step before full self-hosting.
Open model + OVHcloud/Scaleway: Llama 3, Mistral 7B, Qwen 2.5 on a GPU rented in France or Germany. Full control, cost scales with volume.
Local model: for development teams or edge deployments, Ollama + Gemma 4 runs on a good laptop with no data leaving the machine.

TL;DR

LLM inference splits into prefill (compute-bound) and decode (memory-bound). This asymmetry explains why each optimization technique targets a specific phase. Early in a product, cloud APIs are the right default. When costs, latency, or GDPR requirements justify it, self-hosting on EU infrastructure with an open-source model is the next step: full data control, predictable pricing, and outside CLOUD Act reach.

Evaluating an inference architecture for your use cases? Let's discuss your project's constraints.

LLM Inference Engineering: Optimize Latency and Costs