
2026
LLM Platform Reliability - Pretto
Improvement of an unstable AI platform to remove service outages and make the teams' day-to-day reliable.
Diagnosis and refactoring of Pretto's internal LLM platform, which predated my arrival and was experiencing recurring stability issues. Initial finding: overloaded workers were causing regular service degradation. Code analysis revealed unnecessary processing and architectural inefficiencies consuming resources without added value. Work done: - Full audit of the existing codebase and root cause identification. - Targeted refactoring to eliminate redundant processing. - Improved uptime and overall platform resilience.
Detailed case study
When I joined Pretto, the internal LLM platform was quietly degrading production: workers ran out of RAM, latency doubled with no obvious cause, and unbounded retries cascaded into client timeouts. The platform had been running for months, looked fine on paper, but burned far more resources than it should have for the work it actually did.
Answer: two structural issues caused most of the incidents. First, the pipeline downloaded each document then re-encoded it as base64 before sending it to the LLM providers, even though Anthropic, OpenAI and Mistral now accept direct URLs. Second, every module instantiated its own third-party clients (OpenAI, Redis, S3), with no pooling or reuse. Fixing those two points slashed worker RAM and stabilised latency. This article walks through the refactor and lists 8 other common degradation causes to audit on any LLM platform.
Why does an LLM platform quietly degrade in production?
An LLM platform almost never falls over in one go. It degrades in steps, and the diagnosis is tricky for three reasons.
First, LLMs mask the real costs. When a request takes 4 seconds, the instinct is to blame the model. That is often wrong: the wasted time sits before or after the LLM call, in serialisation, downloads or payload shaping.
Second, retries amplify incidents. A poorly bounded retry strategy turns a 5% error spike into a full worker-pool overload. The system thinks it is self-healing while it is actually self-attacking.
Third, workers survive a long time before dying. A memory leak or a badly recycled third-party client only explodes after hours. Periodic restarts mask the issue until traffic spikes.
At Pretto, the audit showed most of the wasted load came from two very specific places.
Cause #1: why avoid base64 when providers accept URLs?
The platform handled a lot of documents: PDFs, scanned proofs, screenshots uploaded by users. The legacy pipeline, for every document, did:
- Download from object storage
- Read into worker memory
- Encode as base64
- Embed inside the JSON payload sent to the LLM provider
Base64 inflates payload size by roughly 33%, pins worker RAM for the full duration of the LLM call, and adds non-trivial CPU cost on large documents. All of that for a result strictly equivalent to passing a URL.
Since 2024-2025, the major providers all accept direct URLs for images and PDFs: Anthropic Claude (URL-based vision), OpenAI (file inputs and image URLs), Mistral on multimodal models. The provider fetches the file itself, with no detour through our worker.
Before: base64 encoding inside the worker
import base64
import requests
from anthropic import Anthropic
client = Anthropic()
def analyze_document(document_url: str, prompt: str) -> str:
# Pulls the file into worker RAM
raw = requests.get(document_url, timeout=30).content
# Base64 encodes it (size x1.33, non-trivial CPU)
encoded = base64.standard_b64encode(raw).decode("utf-8")
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": encoded,
},
},
{"type": "text", "text": prompt},
],
}],
)
return response.content[0].textAfter: direct URL, the worker never touches the file
from anthropic import Anthropic
client = Anthropic()
def analyze_document(document_url: str, prompt: str) -> str:
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {"type": "url", "url": document_url},
},
{"type": "text", "text": prompt},
],
}],
)
return response.content[0].textThe worker stops downloading, encoding and holding the file in RAM during the LLM call. On multi-megabyte documents, the savings are immediate, and latency drops because we removed a full network hop (object storage to worker).
Prerequisite: URLs must be reachable from the outside (short-lived signed URLs, or a provider-readable bucket). That is an acceptable tradeoff, but it has to be confirmed explicitly.
Cause #2: why centralise third-party clients behind a Factory?
The second problem was subtler. Every module that called a third-party service contained a pattern like client = OpenAI() or redis = Redis(host=...), created inside the function, sometimes on every request. The result:
- No connection pooling: each instance opened its own HTTP pool, with no coordination
- No global rate limiting: impossible to cap concurrent calls to a provider
- No centralised monitoring: metrics were scattered across modules
- GC pressure: clients created on the fly were never reused
The fix is a Factory pattern, applied strictly: one creation point per client type, reusable instances, and a process-wide cache.
Before: ad hoc clients everywhere
# module_a.py
from openai import OpenAI
import redis
def process_invoice(data):
client = OpenAI() # brand new HTTP pool every call
cache = redis.Redis(host="redis", port=6379) # new TCP connection
...
# module_b.py
from openai import OpenAI
def classify_email(text):
client = OpenAI() # another pool, still not shared
...After: centralised Factory with reused clients
# clients/factory.py
from functools import lru_cache
from anthropic import Anthropic
from openai import OpenAI
import httpx
import redis
@lru_cache(maxsize=1)
def get_openai() -> OpenAI:
# Explicit connection pooling + bounded timeouts
http = httpx.Client(
limits=httpx.Limits(max_connections=100, max_keepalive_connections=20),
timeout=httpx.Timeout(30.0, connect=5.0),
)
return OpenAI(http_client=http)
@lru_cache(maxsize=1)
def get_anthropic() -> Anthropic:
return Anthropic(max_retries=2, timeout=30.0)
@lru_cache(maxsize=1)
def get_redis() -> redis.Redis:
pool = redis.ConnectionPool(
host="redis", port=6379,
max_connections=50,
socket_timeout=2.0,
)
return redis.Redis(connection_pool=pool)# module_a.py
from clients.factory import get_openai, get_redis
def process_invoice(data):
client = get_openai() # same instance everywhere in the worker
cache = get_redis()
...This pattern delivers three wins at once: real HTTP/TCP connection pooling (see httpx connection pooling docs), a single hook for rate limiting and logging, and a measurable drop in GC pressure.
At Pretto, that change stabilised P95 latencies that until then fluctuated based on how many client instances happened to be alive in memory.
8 other common causes of LLM platform slowness
The audit surfaced a series of more classic issues. Use this as a reference table when auditing any existing LLM platform.
| # | Cause | Symptom | Fix |
|---|---|---|---|
| 1 | Unbounded retries | Error spike that never decays | Exponential backoff + per-request retry budget + circuit breaker |
| 2 | Missing timeouts | Workers blocked indefinitely on an LLM call | Explicit timeouts at every layer (HTTP, LLM, DB) |
| 3 | No circuit breaker | One provider down takes the whole chain down | Per-provider circuit breaker, fallback or graceful degradation |
| 4 | Heavy JSON serialisation | CPU at 100% on multi-MB payloads | orjson, streaming, or switch to binary format (protobuf) |
| 5 | Blocking synchronous calls | Idle workers waiting on network | Async architecture (asyncio, httpx.AsyncClient) |
| 6 | Blocking synchronous logs | Latency tracks log throughput | Async logging, batched writes, no logs on the hot path |
| 7 | Duplicate tokenizers in memory | RAM grows linearly with the number of modules | Singleton tokenizer, loaded once per process |
| 8 | No response cache | Same prompts billed many times | Key-value cache on prompt hash + model version |
On top of that list: poor IO vs CPU separation (heavy PDF parsing on the same event loop as LLM calls, for example), and missing batching on embeddings, where the API is called once per document instead of grouped.
How do you audit an existing LLM platform?
The method I applied at Pretto is reproducible. Four steps.
1. Instrument the workers before touching the code. RAM, CPU, queue depth, per-endpoint latency. Without a measured baseline, every optimisation is speculation. Prometheus plus Grafana is enough.
2. Profile a worker under real load. With py-spy or scalene on a live production worker (or a replay), you see within minutes where CPU and memory go. That is often where you discover 40% of the time goes into serialisation or base64.
3. Trace inbound and outbound payloads. Log the size and shape of payloads sent to providers. A 4 MB payload going to Anthropic is almost always a signal that you are encoding something that should be passed as a URL.
4. List every place where third-party clients are created. A simple grep on OpenAI(, Anthropic(, Redis(, boto3.client( across the repo. Anything that lives outside a dedicated factory module is technical debt to repay.
From that snapshot, prioritise by impact-over-effort. At Pretto, removing base64 was the highest-ratio action, followed by centralising clients.
Conclusion: TL;DR
An LLM platform in production rarely degrades because of the model itself. It degrades because of everything around it: useless encodings, mismanaged third-party clients, unbounded retries, missing timeouts.
Three key takeaways:
- Pass URLs to providers whenever you can. It is free, supported by Anthropic, OpenAI, Mistral, and it lifts a huge load off your workers.
- Centralise every third-party client behind a Factory pattern. One place for pooling, timeouts, retries and monitoring.
- Audit with metrics, not intuitions. Instrument first, refactor second.
For more on production LLM architectures, see also the Pretto batch inference pipeline or the broader AI integration services.
If you run an LLM platform that quietly degrades in production, or you want a focused audit on reliability and AI integration cost for SMBs, reach out through the contact form.
Client testimonial
“Following his internship, we continued working with Pierre as a freelancer while he pursued his studies in parallel. He is hardworking, efficient, precise and reliable. Thank you again Pierre for all the great work, see you very soon :)”