Name: Pierre Kasparian - Intégration IA freelance
Rating: 5

At Pretto, dozens of banks send weekly emails updating their mortgage rates. Processing these emails manually to keep the rate database current took hours. I automated the process with a four-step Python ETL pipeline, using an LLM for structured extraction and the internal batch service to handle the volume.

The short version: the pipeline cleans raw email content, filters out useless images (logos, signature banners), enriches the prompt with rates already in the database, and sends everything to the LLM in batch to extract changes as structured data.

What problem does this solve?

Pretto is a mortgage brokerage. Each partner bank regularly sends rate-change notifications by email: fixed rates over 15, 20, or 25 years, eligibility conditions, sometimes full rate grids.

These emails arrive in bulk, in heterogeneous formats, and rarely contain any machine-readable structure. A typical email looks like this:

"Following our latest decisions, we would like to inform you of a rate revision effective Monday 3 June. For a 20-year loan, the nominal rate moves to 3.45% (previously 3.55%)..."

Someone had to read, extract the new rates, compare them against the previous ones, and update the database. For ten or so banks with several updates a week, that is a repetitive task pulling qualified analysis time away from more meaningful work.

LLM-based extraction is well suited here: the input is free-form, but the output schema is defined (which rate, for which duration, effective date). That is exactly the kind of task LLMs handle well with a structured prompt and an output schema.

Architecture in four steps

The pipeline follows this sequential flow:

Formatting: clean raw email content (HTML, encoding, metadata)
Image filtering: detect and exclude logos and signature banners
Enrichment: inject current rates from the database into the prompt
Batch inference: structured extraction via the internal batch service

Each step makes the email more useful to the model. Without this preprocessing, the context sent to the LLM would be noisy, expensive in tokens, and extractions less reliable.

Step 1: how to clean raw email content?

Received emails are multipart MIME messages. The useful text is often buried inside rich HTML: style tags, Outlook conditional comments, invisible tracking pixels.

This cleaning was done in SQL via DBT transformations. The data engineering goal was to push as many transformations as possible into SQL, to avoid maintaining DAGs and to keep a clean data lineage.

Step 2: why filter email images before calling the LLM?

This step had the biggest impact on input quality. Banker emails systematically contain signature logos, commercial banners, and sometimes legal stamps. Encoding these as base64 and including them in the context wastes tokens and can confuse the extraction.

The filtering strategy uses three criteria:

The filename of the image attachment
The MIME type of the file
The dimensions

A signature logo typically matches at least two of these three criteria. The most robust signal is dimensions: logos are short in height or very elongated horizontally. A rate table sent as an image (a rare case) is larger and more square, so it passes the filter. This filtering was also done in dbt.

Step 3: how to enrich the prompt with internal data?

Before sending the email to the LLM, we inject the rates currently stored in the database for the sending bank into the prompt.

Why? The LLM needs to distinguish two cases: an update to an existing rate and the announcement of a new product not yet in the database. Without context, it cannot know what was there before. More critically, some emails express variations relatively ("a 10-basis-point cut") rather than absolutely. With current rates injected, the LLM can compute the final value.

The prompt was refined using Langfuse, with the goal of improving extraction quality by selecting the best prompt and LLM combination.

Step 4: batch inference and Pydantic validation

Once the emails are prepared, inference is handed off to the internal batch service. The call submits a batch of jobs, polls until completion, and retrieves the outputs. Provider batch APIs offer around 50% cost reduction versus synchronous calls, which on weekly volumes is a meaningful saving.

The LLM output is validated by Pydantic before any write to the database:

from pydantic import BaseModel
from datetime import date
 
class RateChange(BaseModel):
    duration: int
    new_rate: float
    effective_date: date | None
    change_type: str  # "update" or "new"

If the returned JSON is invalid (format hallucination, missing field), Pydantic raises an error that gets logged without silent writes to the database. This is non-negotiable: a silently misextracted financial figure is worse than missing data.

Key takeaways

Image filtering is not optional. Without it, base64-encoded logos inflate the context and degrade extraction quality. It is not a comfort optimization.

Contextual enrichment changes the equation. Injecting current rates into the prompt significantly reduces errors on emails expressing relative variations. The LLM without context will hallucinate absolute values when the email says "10bp cut."

Deferred batch processing is the right approach here. Rate extraction is not real-time-critical. Accumulating emails throughout the day and processing them in a batch run in the evening costs less, is simpler to operate, and the result is available the next morning.

The same pattern applies in other contexts: extracting information from supplier emails, processing orders received by email, classifying incoming support tickets. The email preprocessing is always the same; only the prompt and the Pydantic schema change depending on the business domain.

If your company processes email volumes and you need to extract structured data from them, let us talk.

Email Processing Pipeline - Pretto

Detailed case study