AI Agent Harness: How to Make It Self-Repairing
June 12, 2026 · 7 min read · Guides
AI Engineer — UTT 4th year · LLM, RAG & GDPR compliance specialist · 15+ client projects
An AI agent fails in production. You open the traces, find which step broke, form a hypothesis, apply a patch, redeploy. Two weeks later, a model update breaks something different, and the loop starts over.
Direct answer: the problem is not the observability tooling, which is usually fine. The problem is that traces, tests, and debugging are separate silos. None of them closes the loop between "something broke in production" and "this thing won't break again." Opik connects these layers into a single self-repairing loop.
The Core Problem: Observability Without Repair
Most production AI agent stacks give you visibility into what happened: latency, tokens consumed, tool calls, LangChain or CrewAI traces. That is useful. But it does not say why something broke, and it does nothing to prevent it from breaking again.
The classic cycle when an agent fails:
- Alert or ticket
- Manual dive into traces
- Hypothesis about the cause
- Patch
- Hope the regression does not return
This cycle repeats with every model upgrade, every new tool, every prompt change. The larger the harness grows, the heavier the maintenance burden.
The 4 Layers of Opik
Opik is an open-source platform (19,000+ GitHub stars, self-hostable with 3 commands) that connects observability, automated diagnosis, testing, and sandboxing into one loop. It supports LangGraph, CrewAI, and around fifty other frameworks.
Layer 1: Automatic Tracing
A simple decorator on your agent function captures everything:
import opik
@opik.track
def run_agent(user_input: str) -> str:
# all agent logic here
result = agent.invoke({"input": user_input})
return result["output"]Every LLM call, every tool invocation, every retrieval step is recorded. The active configuration at execution time is associated with the trace, making failing inputs fully reproducible.
Layer 2: Ollie, the Diagnosis Agent
Ollie is an embedded coding agent that reads failing traces and source code, identifies the problematic lines, generates a diff, and waits for developer approval before applying the fix.
The full workflow:
- Ollie reads the failing trace and source code
- It identifies problematic spans in the trace tree
- It generates a diff for approval
- Once approved, it reruns the agent in the sandbox against the original failing input
- It produces a before/after trace comparison
- It automatically registers the original failure as a regression test
What distinguishes Ollie from a simple code reviewer: it reasons about trace spans, not just code lines. It can therefore diagnose behavioral errors (bad routing decision, tool hallucination) that code alone does not reveal.
Layer 3: Plain-English Test Suites
Instead of numerical metrics on labeled datasets, you write clear assertions:
"The response must include specific deal details, not just a count"
"The response must never reveal unauthorized information"
"The agent must call the search tool before answering on recent data"Opik converts these assertions into LLM-as-a-judge checks (pass/fail). The key point: failing production traces are automatically added as new test cases. The suite grows organically with each incident. The harness becomes progressively harder to break over time.
Layer 4: Full Agent Sandbox
Unlike prompt playgrounds that only test a single LLM call in isolation, the Opik sandbox runs the full agent end-to-end inside the UI. You can swap models, change prompts, or add tools and observe the effects across the entire agent graph.
The Complete Self-Repair Loop
This is the central insight of Avi Chawla's original article: the power is not in any individual layer, it is in their integration.
Production incident
↓
Ollie reads trace + code, proposes a fix
↓
Developer approves
↓
Ollie rereplays in sandbox against original failing input
↓
Fix validated → saved as agent blueprint
↓
Environment pointer moved to staging
↓
Original failure registered as permanent regression test
Each resolved incident reinforces the harness. The system repairs and solidifies over time, instead of eroding.
Installation and Self-Hosting
Self-hosting is an important point for teams with GDPR or data sovereignty constraints: your agent traces often contain user data (queries, conversation context, outputs). Keeping them on your infrastructure avoids any transfer to an external provider.
# Local startup with Docker
git clone https://github.com/comet-ml/opik
cd opik
docker compose up -d
# Or via pip for Python integration
pip install opikMinimal configuration to connect your agent:
import opik
opik.configure(
url="http://localhost:5173", # your self-hosted instance
api_key="local"
)Opik also supports a cloud mode (app.comet.com/opik) if self-hosting is not a constraint.
When Is It Worth Setting Up Opik?
A few signals that manual looping is getting expensive:
- The same types of failures return after each model update
- Agent PR reviews take longer than regular code reviews because behavior is hard to predict
- You have no regression tests for cases that already failed in production
- Every prompt change requires manual testing on a sample of conversations
A harness covering 20 plain-English test scenarios and automatically capturing production failures is worth weeks of reactive debugging.
TL;DR
Opik solves the problem of observability tools that do not close the loop: automatic tracing, AI-powered diagnosis (Ollie), plain-English test suites, full agent sandbox. The core insight: every production incident becomes a permanent regression test. The harness repairs and solidifies over time. Self-hostable, compatible with LangGraph and CrewAI, 19,000+ GitHub stars.
Building AI agents in production with reliability or compliance requirements? Let's talk about your architecture.
About the author
Pierre Kasparian4th-year engineering student at UTT (University of Technology of Troyes) and AI integration freelancer. He deploys LLMs, RAG pipelines, and AI agents for French and European companies, with strong expertise in GDPR compliance and European hosting. 15+ client projects, including Pretto and LiveSession.