Name: Pierre Kasparian - Intégration IA freelance
Rating: 5

Langfuse is an open-source observability platform for LLM applications: it lets teams version prompts, build annotated datasets (input / expected output), and orchestrate evaluations while tracking scores over time. At Pretto, Langfuse was already in place to manage prompts and datasets. The problem: evaluations themselves ran through a shared Google Colab notebook, and each prompt had its own Python file with its own evaluation logic. Adding a new use case meant filing a ticket with the Data team, waiting for a developer to write the script, then opening the Colab to run the evaluation — a flow that was too slow and too fragile.

I unified all of this into a single generic engine driven by configuration, and replaced the Colab with a Slack interface (Langfuse's native integration options did not connect well to our internal API, which was heavily customised for our specific use cases). The result: business teams can now evaluate their own prompts autonomously, with no Data team ticket required, and the platform went on to enable automatic prompt self-improvement in the next project.

Before: a platform that undermined its own adoption

The platform existed and worked — but its architecture discouraged anyone from using it.

One Python file per prompt. For every use case, the Data team had written a dedicated script: one for the chatbot, one for classification, one for data extraction. Reasonable at three use cases, unmanageable at fifteen.

The ticket-plus-Colab friction. When a business team wanted to test a new prompt, they had to file a ticket, wait for a developer to write the associated script, then go into a shared Google Colab to run the evaluation. Two to three days minimum for something that should take five minutes — and the Colab still required editing cells to change the prompt or dataset being evaluated.

No adoption. A platform that is expensive to use is a platform nobody really uses. Prompts were shipped to production without systematic evaluation; regressions were discovered late, often after an incident.

The goal: zero friction between "new prompt" and "evaluation running"

The refactoring had a single objective: let a business team create a prompt and dataset in Langfuse, then run the evaluation without ever opening a code editor — or a Colab.

Concretely:

One Python script driven by configuration (prompt ID, dataset, metrics)
Metrics selectable at runtime, not hardcoded into the script
A Slack interface to trigger evaluations (Langfuse's native integration options did not connect to our heavily customised internal API)
Langfuse as the sole entry point for managing prompts and datasets

The refactoring: a generic engine driven by config

The principle is straightforward: extract everything that varied across files (the prompt, the dataset, the metrics) and turn them into parameters of a single function.

def run_evaluation(
    prompt_name: str,
    dataset_name: str,
    metrics: list[EvalMetric],
    model: str = "gpt-4o",
) -> EvaluationReport:
    prompt = langfuse.get_prompt(prompt_name)
    dataset = langfuse.get_dataset(dataset_name)
 
    results = []
    for item in dataset.items:
        output = run_prompt(prompt, item.input, model)
        scores = {m.name: m.score(item.input, output, item.expected_output) for m in metrics}
        results.append(EvaluationResult(item=item, output=output, scores=scores))
 
    return EvaluationReport(results=results)

Each EvalMetric implements a common interface: it takes the input, the produced output, and the expected output, and returns a score between 0 and 1. This makes it possible to mix deterministic metrics and LLM judges in the same evaluation run, depending on the task.

Concrete example: evaluating an SMS chatbot with tool calling

The most technically interesting case was the customer chatbot. It communicates via SMS and can call several business tools: book an appointment, look up a broker's available slots, fetch information on a file.

The measurement problem. How do you automatically evaluate "the bot called the right tools at the right time"? An LLM judge is expensive and hard to reproduce. An exact comparison is too strict (the order of calls can vary without being an error).

The metric we adopted: coherence x Jaccard. After analysing the options, I proposed a two-component metric:

def jaccard_tool_score(expected_tools: set[str], called_tools: set[str]) -> float:
    if not expected_tools and not called_tools:
        return 1.0
    intersection = expected_tools & called_tools
    union = expected_tools | called_tools
    return len(intersection) / len(union)
 
def coherence_score(expected_sequence: list[str], called_sequence: list[str]) -> float:
    # Longest common subsequence, normalised
    lcs_len = longest_common_subsequence(expected_sequence, called_sequence)
    return lcs_len / max(len(expected_sequence), len(called_sequence), 1)
 
def tool_calling_score(item: DatasetItem, output: ChatOutput) -> float:
    jaccard = jaccard_tool_score(set(item.expected_tools), set(output.called_tools))
    coherence = coherence_score(item.expected_tools, output.called_tools)
    return jaccard * coherence

Jaccard measures coverage: were the right tools called, even if the order differs?
Coherence measures the sequence: are tools called in a logical order?
The product doubly penalises serious errors (wrong tool + wrong order) without over-penalising a simple inversion.

This metric was adopted after a team discussion. It is computable without an LLM, deterministic, and easy to interpret: a score of 0.8 means something concrete.

Results

Business team adoption. Before the refactoring, evaluation was the exclusive domain of the Data team. After, non-developer profiles started evaluating their own prompts autonomously. Some business team members took it up on their own, without being asked.

Fewer production regressions. Systematic evaluation before deployment caught silent regressions that manual testing missed. The -80% observed over the period correlates directly with the generalisation of evaluation.

An extensible platform. The generic engine made the next project possible: a prompt auto-improvement pipeline that uses exactly the same evaluation infrastructure to score improved versions.

What this teaches about industrialising LLM evaluation

Four observations from this project, applicable to any team that wants to evaluate its prompts seriously.

Evaluation must cost less than not evaluating. If running an evaluation requires a ticket and two days, people will not use it. The investment in a generic engine pays off as soon as you have more than three active use cases.

Deterministic metrics first. For structured tasks (classification, tool calling, JSON extraction), a metric that needs no LLM is faster, cheaper, and more reproducible than an LLM judge. Reserve LLM-as-a-judge for genuinely qualitative cases.

A minimal dataset beats a perfect dataset that does not exist. Start with 20 hand-annotated cases. Imperfect evaluation on a small dataset is infinitely more useful than no evaluation while waiting for a 1000-case corpus.

Evaluation is a workflow, not a one-shot. It has value only when repeated at every prompt change, with results that are stored and comparable over time. That is precisely what Langfuse provides with its versioned datasets and run history.

Pitfalls to avoid

One Python file per use case. That was the starting configuration here. It seems reasonable at 3 use cases, it becomes unmanageable at 15. Start with a generic engine from day one, even an imperfect one.

A single universal metric. No score measures a conversational chatbot, a JSON extractor, and a classifier equally well. Three metrics tailored to their context beat one mediocre metric across the board.

LLM-as-a-judge using the same model you are evaluating. GPT-4 tends to favour itself in comparisons. If you are evaluating GPT-4o, use a different judge, or a deterministic metric.

Not versioning datasets. A dataset that evolves without traceability makes historical comparisons useless. Langfuse handles this natively.

Conclusion: TL;DR

Refactoring an evaluation platform is not a code-quality project. It is an adoption project: making evaluation accessible to the teams that need it, with no friction. At Pretto, this meant autonomous business teams, fewer regressions, and infrastructure that enabled the next project.

Key takeaways:

A config-driven generic engine eliminates the friction between "new prompt" and "evaluation running"
Deterministic metrics (Jaccard, LCS) outperform LLM judges for structured tasks
Evaluation only has value when systematic and historised, not one-off
A solid evaluation platform is the prerequisite for prompt self-improvement

If you want to structure your prompt evaluation or give your AI team more autonomy, let's talk.

Prompt Evaluation Platform - Pretto

Detailed case study