Name: Pierre Kasparian - Intégration IA freelance
Rating: 5

Most RAG pipelines ignore images. This is a mistake for technical documentation: data tables, architecture diagrams, annotated screenshots, comparison matrices. These elements contain information that the surrounding text does not repeat.

Direct answer: the right approach is not to process images with a vision model at every query (too expensive, too slow), but to describe them once in natural language at indexing time, and store those descriptions as ordinary text chunks. Per-query overhead drops from 27-51% to 1-6%.

Two types of images in documentation

Not all images carry the same value in a RAG. You need to distinguish:

Illustrative images: they clarify adjacent text (a screenshot showing which button to click), but do not carry independent information. If you skip them, users can still get a correct answer from the text.

Load-bearing images: they contain essential information not redundant with the text. Examples: data tables rendered as images, architecture diagrams, comparison matrices, annotated screenshots. If you skip them, you lose real information.

Testing by kapa.ai on their technical documentation system shows that load-bearing images were cited in 10 to 64% of answers depending on the project, with a statistically significant improvement in answer quality.

Why query-time image processing does not scale

The naive approach: pass raw images to a vision model (GPT-4o, Claude) at every query. Three structural problems:

Cost. Per kapa.ai measurements:

Passing raw images increases GPT query costs by 27% per request
On Claude, the overhead reaches 51% per request

This is a recurring cost on every query, not a one-time cost.

Capacity. A typical retrieval returns 20-30 chunks. If those chunks include images, you quickly approach or exceed the context window of most models. You must choose between image coverage and text coverage.

Retrieval quality. CLIP-style multimodal embeddings "wash out exactly the fine detail that matters in charts, tables, and annotated screenshots." They are good for finding visually similar images, not for capturing textual or numeric content inside diagrams.

The right approach: index-time captioning

Generate a natural language description for each image once, at indexing time. Store the description as a separate text chunk. Retrieve it like any other text.

from openai import OpenAI
import base64
from pathlib import Path
 
client = OpenAI()
 
def describe_image(image_path: str, surrounding_text: str, page_title: str) -> str:
    """
    Generates an image description including surrounding text context.
    Context significantly improves description quality.
    """
    with open(image_path, "rb") as f:
        image_data = base64.b64encode(f.read()).decode("utf-8")
    
    suffix = Path(image_path).suffix.lstrip(".")
    mime = f"image/{suffix}" if suffix != "jpg" else "image/jpeg"
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",  # cheap model, sufficient for captioning
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": f"""You are a technical documentation assistant.
Page: {page_title}
Surrounding text: {surrounding_text[:500]}
 
Describe this image accurately and in a structured way.
If it is a table, transcribe the data in markdown.
If it is a diagram, describe the components and their relationships.
If it is a screenshot, describe the interface and relevant elements."""
                    },
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:{mime};base64,{image_data}"}
                    }
                ]
            }
        ],
        max_tokens=500
    )
    
    return response.choices[0].message.content

Why include context? A screenshot of a settings panel means nothing without knowing which product or what the preceding paragraph explains. Context produces significantly more accurate descriptions.

Why GPT-4o mini? Frontier models are not necessary for captioning. GPT-4o mini produces results nearly equivalent to GPT-4o on this task, at a fraction of the cost. And since this is an indexing cost (one-time per image), pricing matters less than for queries.

Step 1: filter junk images

Documentation sites are full of non-informational images: logos, decorative banners, icons, dividers. Passing these to a vision model wastes money and pollutes the index.

def is_informational_image(image_path: str) -> bool:
    """
    Zero-shot classifier to distinguish informative images from decorations.
    """
    with open(image_path, "rb") as f:
        image_data = base64.b64encode(f.read()).decode("utf-8")
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": """Does this image contain useful technical information?
Answer only YES or NO.
 
Answer YES if: data table, diagram, annotated screenshot, chart, architecture schema.
Answer NO if: logo, icon, decorative illustration, banner, generic photo."""
                    },
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/jpeg;base64,{image_data}"}
                    }
                ]
            }
        ],
        max_tokens=5
    )
    
    return response.choices[0].message.content.strip().upper() == "YES"

Step 2: store descriptions as independent chunks

Descriptions should not be inlined into the parent text chunk. They must be stored as independent chunks with metadata linking back to the source page.

from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct
import uuid
 
def index_image_description(
    description: str,
    source_page: str,
    image_path: str,
    position_in_page: int,
    embedder
) -> str:
    """
    Indexes an image description as an independent chunk.
    Returns the ID of the created chunk.
    """
    chunk_id = str(uuid.uuid4())
    embedding = embedder.embed(description)
    
    client = QdrantClient(url="http://localhost:6333")
    client.upsert(
        collection_name="docs",
        points=[
            PointStruct(
                id=chunk_id,
                vector=embedding,
                payload={
                    "content": description,
                    "content_type": "image_description",
                    "source_page": source_page,
                    "image_path": image_path,
                    "position": position_in_page,
                }
            )
        ]
    )
    
    return chunk_id

Storing descriptions as separate chunks lets the retriever surface them on their own merit, even if the surrounding text was not retrieved.

Full indexing pipeline

def index_document_with_images(doc_path: str, embedder):
    """
    Full pipeline: parse document, filter images, generate descriptions,
    index text and images separately.
    """
    doc = parse_document(doc_path)  # returns text + images with positions
    
    # Standard text indexing
    for chunk in chunk_text(doc.text):
        embedding = embedder.embed(chunk.content)
        index_text_chunk(embedding, chunk)
    
    # Image indexing
    for image in doc.images:
        # Step 1: filter decorative images
        if not is_informational_image(image.path):
            continue
        
        # Step 2: generate description with context
        surrounding = doc.get_text_around(image.position, chars=300)
        description = describe_image(image.path, surrounding, doc.title)
        
        # Step 3: index as independent chunk
        index_image_description(
            description=description,
            source_page=doc.url,
            image_path=image.path,
            position_in_page=image.position,
            embedder=embedder
        )

Measured results

Across three different technical documentation projects, kapa.ai measured:

Metric	Query-time processing	Index-time captioning
Per-query overhead (GPT)	+27%	+1-2%
Per-query overhead (Claude)	+51%	+1-2%
Images cited in answers	N/A	10-64%
Answer quality improvement	N/A	Significant
Added latency	+300-800 ms	0 ms

TL;DR

To integrate images in a technical documentation RAG: caption at indexing time, not at query time. Filter decorative images before captioning. Include surrounding text context in the description prompt. Store descriptions as independent chunks.

GPT-4o mini is sufficient for captioning, and since it is a one-time cost per image (not per query), the economics are structural.

Want to add image support to your RAG pipeline? Let's discuss your project.

Indexing images in a RAG pipeline: production guide