Name: Pierre Kasparian - Intégration IA freelance
Rating: 5

Your RAG retrieves 10 chunks, but the most relevant one often lands in 4th or 5th position. The bi-encoder compares query and document embeddings separately, which is not the most accurate way to measure relevance.

Direct answer: a cross-encoder reranker takes the (query, chunk) pair as input and scores them together, rather than comparing separate embeddings. It is used as a second stage: the bi-encoder retriever fetches the top-K candidates quickly, the cross-encoder reorders them with precision. This combination significantly improves response quality without changing your retrieval architecture.

Why Do Bi-Encoders Alone Miss Relevant Results?

Bi-encoders encode the query and each document separately, with no interaction between the two. This independent comparison misses fine-grained semantic matches when exact words do not align in the embedding space. A cross-encoder analyses the (query, document) pair in a single model pass, capturing the interactions that bi-encoders cannot detect when comparing independently generated vectors.

Standard retrieval relies on bi-encoders: the query and each document are encoded separately into vectors, then cosine distance is measured between them.

This is fast and scalable. But it is approximate by design: both embeddings are produced independently, with no interaction between the query and the document content.

A concrete example: if the query is "refund timeline after cancellation" and a chunk contains "our cancellation policy provides a 14-business-day window for any refund", the bi-encoder may miss this match if the exact words do not align well in the embedding space.

A cross-encoder analyses the (query, chunk) pair together in a single model pass, capturing fine-grained semantic interactions between the two.

How Does a Cross-Encoder Work?

A cross-encoder concatenates the query and document with a separator token, then passes this pair through a transformer to produce a relevance score between 0 and 1. It cannot pre-compute document embeddings in advance, ruling it out for initial retrieval over thousands of documents, but making it ideal for precisely reordering a small set of candidates retrieved by the bi-encoder.

Instead of comparing vectors, the cross-encoder passes the concatenation [query] [SEP] [chunk] through a standard transformer and outputs a relevance score between 0 and 1.

Input  : "refund timeline after cancellation" [SEP] "our policy provides 14 days..."
Output : 0.94  ← relevance score

The main drawback is scalability: a cross-encoder cannot pre-encode documents, it must process each (query, document) pair at query time. Impossible to use for initial retrieval over thousands of documents.

Hence the two-stage architecture:

Fast retrieval (bi-encoder): fetch the top-50 or top-100 candidates via standard vector search.
Precise reranking (cross-encoder): reorder these candidates by relevance score, keep the top-5 or top-10 for the LLM.

Option 1: Cohere Rerank (Cloud, GDPR-Compliant)

Cohere offers a reranking API ready for production use. Cohere is GDPR-compliant, offers EU hosting, and its reranking models are among the most performant available via API.

import cohere
import os
 
co = cohere.Client(os.getenv("COHERE_API_KEY"))
 
def rerank_with_cohere(query: str, chunks: list[str], top_n: int = 5) -> list[dict]:
    response = co.rerank(
        model="rerank-v3.5",
        query=query,
        documents=chunks,
        top_n=top_n
    )
 
    return [
        {
            "chunk": chunks[result.index],
            "score": result.relevance_score,
            "original_rank": result.index
        }
        for result in response.results
    ]
 
def rag_with_reranker(query: str, vector_store, llm, top_k=50, top_n=5):
    # Step 1: initial retrieval
    initial_results = vector_store.similarity_search(query, k=top_k)
    chunks = [doc.page_content for doc in initial_results]
 
    # Step 2: reranking
    reranked = rerank_with_cohere(query, chunks, top_n=top_n)
 
    # Step 3: generation with the best chunks
    context = "\n\n".join([r["chunk"] for r in reranked])
    return llm.invoke(f"Context:\n{context}\n\nQuestion: {query}")

Advantages: easy to integrate (a few lines of code), highly performant, maintainable without infrastructure. Cohere provides an EU hosting option for sensitive data.

Disadvantages: paid API (billed per reranked token), dependency on an external provider. If your data is strictly confidential, sending it to a cloud API, even a GDPR-compliant one, may not be acceptable depending on your internal policy or DPO requirements.

Option 2: Local Model (Open Source, Sovereign Data)

If you cannot send your documents to an external API, you can host a cross-encoder locally using the sentence-transformers library.

from sentence_transformers import CrossEncoder
 
# Lightweight model, CPU sufficient
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
 
def rerank_local(query: str, chunks: list[str], top_n: int = 5) -> list[dict]:
    pairs = [[query, chunk] for chunk in chunks]
    scores = model.predict(pairs)
 
    ranked = sorted(
        zip(chunks, scores),
        key=lambda x: x[1],
        reverse=True
    )
 
    return [
        {"chunk": chunk, "score": float(score)}
        for chunk, score in ranked[:top_n]
    ]

Available models in the sentence-transformers ecosystem:

Model	Size	Performance	Infrastructure
`ms-marco-MiniLM-L-6-v2`	80 MB	Good	CPU sufficient
`ms-marco-MiniLM-L-12-v2`	130 MB	Very good	CPU sufficient
`ms-marco-electra-base`	430 MB	Excellent	GPU recommended
`bge-reranker-large`	560 MB	Excellent	GPU recommended

Advantages: sovereign data (nothing leaves your infrastructure), zero marginal cost, no external dependency. Lightweight models (MiniLM) run on standard CPU with no special infrastructure required.

Disadvantages: lower performance than the best cloud APIs, especially for highly specific domains. Requires managing model deployment and updates. Slightly higher latency on CPU for large volumes.

Cloud vs Local: How to Choose a Reranker?

Choose Cohere Rerank if your data can transit through an external API and you prioritize performance: Cohere is GDPR-compliant with EU hosting available. Choose a local sentence-transformers model if your data cannot leave your infrastructure under any circumstances, for example medical, legal, or contractual data. Local MiniLM models are sufficient for the vast majority of use cases.

Criterion	Cohere Rerank (cloud)	Local model
Performance	Excellent	Good to very good
Sensitive data	GDPR-compliant, but external API	100% sovereign
Cost	Paid (per token)	Free (own infra)
Integration	Few lines	Few lines
Infrastructure	None	Server with available RAM

The practical rule: if you can send your data to a third-party API (even GDPR-compliant), Cohere is the simplest and most performant choice. If your data cannot leave your infrastructure (medical, legal, sensitive internal data), a local MiniLM model is sufficient for the vast majority of use cases.

When Should You Add a Reranker to a RAG Pipeline?

Add a reranker when your queries are complex, your corpus is dense, or your RAG returns correct but occasionally off-target responses. A reranker typically improves precision by 15 to 30% for a latency overhead of 100 to 300 ms, which is acceptable for most enterprise document assistants and internal knowledge bases.

A reranker is particularly useful when:

Your queries are long or complex (multiple intents in a single question)
Your corpus is dense and documents are semantically similar to each other
Your RAG responses are correct but occasionally miss the best source
You have a high top-K (20+ candidate chunks) that you need to reduce to 3-5

A reranker is less useful when:

Your corpus is small and homogeneous (the bi-encoder is sufficient)
Latency is critical and you cannot afford 100-300 ms overhead
Queries are short and factual

Latency Impact

Solution	Added latency (top-50 reranked to 5)
Cohere Rerank API	100-200 ms
MiniLM local CPU	200-500 ms
Electra local GPU	50-150 ms

For most use cases (document assistant, enterprise chatbot), this latency is acceptable and the precision gain clearly justifies the overhead.

TL;DR

A cross-encoder reranker is one of the most cost-effective improvements you can make to a standard RAG. It fits between retrieval and generation without rethinking your existing architecture.

For non-sensitive or semi-sensitive data: Cohere Rerank is the simplest and most performant solution, with solid GDPR compliance. For strictly confidential data: a local sentence-transformers model on CPU is sufficient in most cases.

Looking to add a reranker to your RAG pipeline or improve the precision of an existing system? Get in touch.

Boosting a RAG with a Cross-Encoder Reranker