Indexing images in a RAG pipeline: production guide
June 5, 2026 · 7 min read · Guides
AI Engineer — UTT 4th year · LLM, RAG & GDPR compliance specialist · 15+ client projects
Most RAG pipelines ignore images. This is a mistake for technical documentation: data tables, architecture diagrams, annotated screenshots, comparison matrices. These elements contain information that the surrounding text does not repeat.
Direct answer: the right approach is not to process images with a vision model at every query (too expensive, too slow), but to describe them once in natural language at indexing time, and store those descriptions as ordinary text chunks. Per-query overhead drops from 27-51% to 1-6%.
Two types of images in documentation
Not all images carry the same value in a RAG. You need to distinguish:
Illustrative images: they clarify adjacent text (a screenshot showing which button to click), but do not carry independent information. If you skip them, users can still get a correct answer from the text.
Load-bearing images: they contain essential information not redundant with the text. Examples: data tables rendered as images, architecture diagrams, comparison matrices, annotated screenshots. If you skip them, you lose real information.
Testing by kapa.ai on their technical documentation system shows that load-bearing images were cited in 10 to 64% of answers depending on the project, with a statistically significant improvement in answer quality.
Why query-time image processing does not scale
The naive approach: pass raw images to a vision model (GPT-4o, Claude) at every query. Three structural problems:
Cost. Per kapa.ai measurements:
- Passing raw images increases GPT query costs by 27% per request
- On Claude, the overhead reaches 51% per request
This is a recurring cost on every query, not a one-time cost.
Capacity. A typical retrieval returns 20-30 chunks. If those chunks include images, you quickly approach or exceed the context window of most models. You must choose between image coverage and text coverage.
Retrieval quality. CLIP-style multimodal embeddings "wash out exactly the fine detail that matters in charts, tables, and annotated screenshots." They are good for finding visually similar images, not for capturing textual or numeric content inside diagrams.
The right approach: index-time captioning
Generate a natural language description for each image once, at indexing time. Store the description as a separate text chunk. Retrieve it like any other text.
from openai import OpenAI
import base64
from pathlib import Path
client = OpenAI()
def describe_image(image_path: str, surrounding_text: str, page_title: str) -> str:
"""
Generates an image description including surrounding text context.
Context significantly improves description quality.
"""
with open(image_path, "rb") as f:
image_data = base64.b64encode(f.read()).decode("utf-8")
suffix = Path(image_path).suffix.lstrip(".")
mime = f"image/{suffix}" if suffix != "jpg" else "image/jpeg"
response = client.chat.completions.create(
model="gpt-4o-mini", # cheap model, sufficient for captioning
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": f"""You are a technical documentation assistant.
Page: {page_title}
Surrounding text: {surrounding_text[:500]}
Describe this image accurately and in a structured way.
If it is a table, transcribe the data in markdown.
If it is a diagram, describe the components and their relationships.
If it is a screenshot, describe the interface and relevant elements."""
},
{
"type": "image_url",
"image_url": {"url": f"data:{mime};base64,{image_data}"}
}
]
}
],
max_tokens=500
)
return response.choices[0].message.contentWhy include context? A screenshot of a settings panel means nothing without knowing which product or what the preceding paragraph explains. Context produces significantly more accurate descriptions.
Why GPT-4o mini? Frontier models are not necessary for captioning. GPT-4o mini produces results nearly equivalent to GPT-4o on this task, at a fraction of the cost. And since this is an indexing cost (one-time per image), pricing matters less than for queries.
Step 1: filter junk images
Documentation sites are full of non-informational images: logos, decorative banners, icons, dividers. Passing these to a vision model wastes money and pollutes the index.
def is_informational_image(image_path: str) -> bool:
"""
Zero-shot classifier to distinguish informative images from decorations.
"""
with open(image_path, "rb") as f:
image_data = base64.b64encode(f.read()).decode("utf-8")
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": """Does this image contain useful technical information?
Answer only YES or NO.
Answer YES if: data table, diagram, annotated screenshot, chart, architecture schema.
Answer NO if: logo, icon, decorative illustration, banner, generic photo."""
},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{image_data}"}
}
]
}
],
max_tokens=5
)
return response.choices[0].message.content.strip().upper() == "YES"Step 2: store descriptions as independent chunks
Descriptions should not be inlined into the parent text chunk. They must be stored as independent chunks with metadata linking back to the source page.
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct
import uuid
def index_image_description(
description: str,
source_page: str,
image_path: str,
position_in_page: int,
embedder
) -> str:
"""
Indexes an image description as an independent chunk.
Returns the ID of the created chunk.
"""
chunk_id = str(uuid.uuid4())
embedding = embedder.embed(description)
client = QdrantClient(url="http://localhost:6333")
client.upsert(
collection_name="docs",
points=[
PointStruct(
id=chunk_id,
vector=embedding,
payload={
"content": description,
"content_type": "image_description",
"source_page": source_page,
"image_path": image_path,
"position": position_in_page,
}
)
]
)
return chunk_idStoring descriptions as separate chunks lets the retriever surface them on their own merit, even if the surrounding text was not retrieved.
Full indexing pipeline
def index_document_with_images(doc_path: str, embedder):
"""
Full pipeline: parse document, filter images, generate descriptions,
index text and images separately.
"""
doc = parse_document(doc_path) # returns text + images with positions
# Standard text indexing
for chunk in chunk_text(doc.text):
embedding = embedder.embed(chunk.content)
index_text_chunk(embedding, chunk)
# Image indexing
for image in doc.images:
# Step 1: filter decorative images
if not is_informational_image(image.path):
continue
# Step 2: generate description with context
surrounding = doc.get_text_around(image.position, chars=300)
description = describe_image(image.path, surrounding, doc.title)
# Step 3: index as independent chunk
index_image_description(
description=description,
source_page=doc.url,
image_path=image.path,
position_in_page=image.position,
embedder=embedder
)Measured results
Across three different technical documentation projects, kapa.ai measured:
| Metric | Query-time processing | Index-time captioning |
|---|---|---|
| Per-query overhead (GPT) | +27% | +1-2% |
| Per-query overhead (Claude) | +51% | +1-2% |
| Images cited in answers | N/A | 10-64% |
| Answer quality improvement | N/A | Significant |
| Added latency | +300-800 ms | 0 ms |
TL;DR
To integrate images in a technical documentation RAG: caption at indexing time, not at query time. Filter decorative images before captioning. Include surrounding text context in the description prompt. Store descriptions as independent chunks.
GPT-4o mini is sufficient for captioning, and since it is a one-time cost per image (not per query), the economics are structural.
Want to add image support to your RAG pipeline? Let's discuss your project.
About the author
Pierre Kasparian4th-year engineering student at UTT (University of Technology of Troyes) and AI integration freelancer. He deploys LLMs, RAG pipelines, and AI agents for French and European companies, with strong expertise in GDPR compliance and European hosting. 15+ client projects, including Pretto and LiveSession.