Name: Pierre Kasparian - Intégration IA freelance
Rating: 5

An AI agent fails in production. You open the traces, find which step broke, form a hypothesis, apply a patch, redeploy. Two weeks later, a model update breaks something different, and the loop starts over.

Direct answer: the problem is not the observability tooling, which is usually fine. The problem is that traces, tests, and debugging are separate silos. None of them closes the loop between "something broke in production" and "this thing won't break again." Opik connects these layers into a single self-repairing loop.

The Core Problem: Observability Without Repair

Most production AI agent stacks give you visibility into what happened: latency, tokens consumed, tool calls, LangChain or CrewAI traces. That is useful. But it does not say why something broke, and it does nothing to prevent it from breaking again.

The classic cycle when an agent fails:

Alert or ticket
Manual dive into traces
Hypothesis about the cause
Patch
Hope the regression does not return

This cycle repeats with every model upgrade, every new tool, every prompt change. The larger the harness grows, the heavier the maintenance burden.

The 4 Layers of Opik

Opik is an open-source platform (19,000+ GitHub stars, self-hostable with 3 commands) that connects observability, automated diagnosis, testing, and sandboxing into one loop. It supports LangGraph, CrewAI, and around fifty other frameworks.

Layer 1: Automatic Tracing

A simple decorator on your agent function captures everything:

import opik
 
@opik.track
def run_agent(user_input: str) -> str:
    # all agent logic here
    result = agent.invoke({"input": user_input})
    return result["output"]

Every LLM call, every tool invocation, every retrieval step is recorded. The active configuration at execution time is associated with the trace, making failing inputs fully reproducible.

Layer 2: Ollie, the Diagnosis Agent

Ollie is an embedded coding agent that reads failing traces and source code, identifies the problematic lines, generates a diff, and waits for developer approval before applying the fix.

The full workflow:

Ollie reads the failing trace and source code
It identifies problematic spans in the trace tree
It generates a diff for approval
Once approved, it reruns the agent in the sandbox against the original failing input
It produces a before/after trace comparison
It automatically registers the original failure as a regression test

What distinguishes Ollie from a simple code reviewer: it reasons about trace spans, not just code lines. It can therefore diagnose behavioral errors (bad routing decision, tool hallucination) that code alone does not reveal.

Layer 3: Plain-English Test Suites

Instead of numerical metrics on labeled datasets, you write clear assertions:

"The response must include specific deal details, not just a count"
"The response must never reveal unauthorized information"
"The agent must call the search tool before answering on recent data"

Opik converts these assertions into LLM-as-a-judge checks (pass/fail). The key point: failing production traces are automatically added as new test cases. The suite grows organically with each incident. The harness becomes progressively harder to break over time.

Layer 4: Full Agent Sandbox

Unlike prompt playgrounds that only test a single LLM call in isolation, the Opik sandbox runs the full agent end-to-end inside the UI. You can swap models, change prompts, or add tools and observe the effects across the entire agent graph.

The Complete Self-Repair Loop

This is the central insight of Avi Chawla's original article: the power is not in any individual layer, it is in their integration.

Production incident
        ↓
Ollie reads trace + code, proposes a fix
        ↓
Developer approves
        ↓
Ollie rereplays in sandbox against original failing input
        ↓
Fix validated → saved as agent blueprint
        ↓
Environment pointer moved to staging
        ↓
Original failure registered as permanent regression test

Each resolved incident reinforces the harness. The system repairs and solidifies over time, instead of eroding.

Installation and Self-Hosting

Self-hosting is an important point for teams with GDPR or data sovereignty constraints: your agent traces often contain user data (queries, conversation context, outputs). Keeping them on your infrastructure avoids any transfer to an external provider.

# Local startup with Docker
git clone https://github.com/comet-ml/opik
cd opik
docker compose up -d
 
# Or via pip for Python integration
pip install opik

Minimal configuration to connect your agent:

import opik
 
opik.configure(
    url="http://localhost:5173",  # your self-hosted instance
    api_key="local"
)

Opik also supports a cloud mode (app.comet.com/opik) if self-hosting is not a constraint.

When Is It Worth Setting Up Opik?

A few signals that manual looping is getting expensive:

The same types of failures return after each model update
Agent PR reviews take longer than regular code reviews because behavior is hard to predict
You have no regression tests for cases that already failed in production
Every prompt change requires manual testing on a sample of conversations

A harness covering 20 plain-English test scenarios and automatically capturing production failures is worth weeks of reactive debugging.

TL;DR

Opik solves the problem of observability tools that do not close the loop: automatic tracing, AI-powered diagnosis (Ollie), plain-English test suites, full agent sandbox. The core insight: every production incident becomes a permanent regression test. The harness repairs and solidifies over time. Self-hostable, compatible with LangGraph and CrewAI, 19,000+ GitHub stars.

Building AI agents in production with reliability or compliance requirements? Let's talk about your architecture.

AI Agent Harness: How to Make It Self-Repairing