Evaluating an AI Agent in Production: the Semantic Judge
June 16, 2026 · 7 min read · Guides
AI Engineer — UTT 4th year · LLM, RAG & GDPR compliance specialist · 15+ client projects
Deploying an AI agent in production without an automated evaluation system is flying blind. Classic metrics (latency, HTTP error rates) say nothing about response quality. But calling a frontier LLM (GPT-5.5, Claude Opus) to judge every production interaction quickly becomes prohibitively expensive.
Direct answer: a semantic judge is a model fine-tuned specifically to evaluate the quality of an agent's traces. LangChain and Fireworks have demonstrated that a fine-tuned Qwen-3.5-35B, trained on a few hundred annotated examples, matches or exceeds frontier models on this task at 10 to 100 times lower cost. Here is the method.
Why classic metrics are not enough
An AI agent produces natural language responses, executes tools, and maintains multi-turn context. Infrastructure metrics (uptime, p95 latency, tokens used) do not capture whether the agent actually answered the user's real question.
The most common quality problems in production agents:
- User-perceived errors: the agent answers off-target, confuses two entities, or forgets a constraint mentioned earlier in the conversation.
- Looping corrections: the user rephrases the same request multiple times, implicitly signaling the previous response was insufficient.
- Unjustified refusals: the agent refuses to perform a legitimate action without a clear reason.
These signals are present in conversation traces, but extracting them manually at scale is not viable. This is the problem an automatic semantic judge solves.
What is a "perceived error"?
LangSmith, LangChain's agent tracing platform, processes billions of tokens per day from production traces. To mine quality signals from every trace, the team defined a central concept: perceived error.
A perceived error occurs when the user thinks the assistant made a mistake or produced something that needed correction. This is not a judgment on the objective correctness of the response, nor on overall user satisfaction. Example: an agent can give a factually correct answer that frustrates the user, without having made an error.
The judge's output format is simple and extractable:
{"perceived_error": true, "reason": "The user corrects the meeting date the assistant used."}This binary + justification format is sufficient to sort traces, build improvement datasets, and detect regressions.
How LangChain and Fireworks built a 100x cheaper judge
The dataset
The team used two internal trace datasets:
- chat-langchain: a Q&A agent over LangChain documentation, multi-turn technical conversations.
- Fleet: a no-code agent tool for varied tasks (writing, research, etc.).
Total: 885 + 911 traces, annotated via a panel of frontier models with human reconciliation for disagreements. Result: approximately 24% of traces with a perceived error on chat-langchain, 18% on Fleet.
The model and fine-tuning
After testing several models, the team chose Qwen-3.5-35B as the base. Smaller models had too many errors on complex multi-turn traces. Training used LoRA SFT on Fireworks infrastructure, trained only on chat-langchain data.
| Model | chat-langchain accuracy | Fleet accuracy |
|---|---|---|
| Qwen-3.5-35B (base) | 90.5% | 83.2% |
| Qwen SFT chat-langchain | 96.1% | 90.8% |
| Qwen SFT Fleet | 92.7% | 91.3% |
| Claude Opus | 91.6% | 90.2% |
| GPT-5.5 | 98.9% | 89.1% |
The key result: the model fine-tuned on chat-langchain outperforms all frontier models on Fleet without having been trained on a single Fleet trace. The judge generalizes well across domains.
The cost
A fine-tuned model corresponds to an open-source model served on dedicated inference: 10 to 100 times cheaper than a Claude Opus or GPT-5.5 call, depending on trace volume. At billions of tokens per day, the cost difference is decisive.
How to implement your own trace judge
Step 1: annotate a minimal dataset
Start with 200 to 500 production traces from your agent. Use a panel of frontier models to generate initial labels, with a human for divergent cases.
import anthropic
import json
client = anthropic.Anthropic()
def annotate_trace(trace: str) -> dict:
response = client.messages.create(
model="claude-opus-4-8",
max_tokens=200,
messages=[{
"role": "user",
"content": f"""Analyze this agent-user conversation trace.
Determine if there is a user-perceived error.
Trace:
{trace}
Respond in JSON: {{"perceived_error": bool, "reason": "..."}}"""
}]
)
return json.loads(response.content[0].text)Step 2: fine-tune an open-source model
With a dataset of 500+ annotated examples, a LoRA SFT on Qwen-3.5-35B or an equivalent model is sufficient to reach frontier-level performance. Platforms like Fireworks, Together AI, or a GPU rented from OVHcloud allow launching the training run without dedicated infrastructure.
Critical points for data preparation:
- Include only Human/AI messages, not tool calls (they add noise without improving perception signals).
- Keep full conversations without truncation: perception judgment requires the complete history.
Step 3: integrate the judge into the production pipeline
The judge integrates after each complete trace, asynchronously so it does not impact agent latency:
import asyncio
from typing import Any
async def evaluate_trace_async(trace: list[dict[str, Any]]) -> dict:
# Call to the fine-tuned judge via dedicated endpoint
# Returns {"perceived_error": bool, "reason": str}
...
async def agent_pipeline(user_message: str, history: list) -> str:
response = await run_agent(user_message, history)
# Async evaluation, non-blocking
asyncio.create_task(
evaluate_trace_async(history + [{"role": "assistant", "content": response}])
)
return responseImplications for businesses
An evaluation system of this kind changes how you manage an agent in production.
Regression detection. Every system prompt update or model change can be compared against the baseline via the perceived error rate. A regression becomes visible before users start complaining.
Continuous improvement dataset. Traces with perceived errors automatically form a dataset for iterative improvement: prompt engineering, fine-tuning, or agent logic modifications.
Compliance and audit. Organizations subject to GDPR can use these logs to demonstrate that the system monitors quality and addresses errors systematically.
The semantic judge is the missing piece between "the agent works in development" and "the agent is reliable in production."
TL;DR
A fine-tuned semantic judge detects perceived errors in AI agent traces at 10-100x the cost of a frontier LLM. The LangChain + Fireworks method: annotate 500 traces, fine-tune Qwen-3.5-35B with LoRA, integrate asynchronously post-trace. The model fine-tuned on one domain generalizes well to others. It is the missing component for moving from an agent that "works" to an agent whose quality can be systematically measured and improved.
Deploying an AI agent and looking to build an evaluation system? Describe your use case.
About the author
Pierre Kasparian4th-year engineering student at UTT (University of Technology of Troyes) and AI integration freelancer. He deploys LLMs, RAG pipelines, and AI agents for French and European companies, with strong expertise in GDPR compliance and European hosting. 15+ client projects, including Pretto and LiveSession.