
2026
Email Processing Pipeline - Pretto
Automatic analysis of emails exchanged between brokers and banks to extract up-to-date mortgage rates. Internal tools are then updated automatically.
Development of an automated pipeline to process emails exchanged between mortgage brokers and bank representatives. End goal: automatically extract key business information (particularly rate changes) from these emails using an LLM as the analysis engine. The pipeline consists of several steps: 1. LLM input formatting: cleaning and structuring raw email content. 2. Image filtering: automatic detection and exclusion of logo attachments, irrelevant to the analysis. 3. Enrichment: aggregation with existing internal company data before sending to the LLM. 4. Inference: use of the previously built batch service to handle production volumes.
Detailed case study
At Pretto, dozens of banks send weekly emails updating their mortgage rates. Processing these emails manually to keep the rate database current took hours. I automated the process with a four-step Python ETL pipeline, using an LLM for structured extraction and the internal batch service to handle the volume.
The short version: the pipeline cleans raw email content, filters out useless images (logos, signature banners), enriches the prompt with rates already in the database, and sends everything to the LLM in batch to extract changes as structured data.
What problem does this solve?
Pretto is a mortgage brokerage. Each partner bank regularly sends rate-change notifications by email: fixed rates over 15, 20, or 25 years, eligibility conditions, sometimes full rate grids.
These emails arrive in bulk, in heterogeneous formats, and rarely contain any machine-readable structure. A typical email looks like this:
"Following our latest decisions, we would like to inform you of a rate revision effective Monday 3 June. For a 20-year loan, the nominal rate moves to 3.45% (previously 3.55%)..."
Someone had to read, extract the new rates, compare them against the previous ones, and update the database. For ten or so banks with several updates a week, that is a repetitive task pulling qualified analysis time away from more meaningful work.
LLM-based extraction is well suited here: the input is free-form, but the output schema is defined (which rate, for which duration, effective date). That is exactly the kind of task LLMs handle well with a structured prompt and an output schema.
Architecture in four steps
The pipeline follows this sequential flow:
- Formatting: clean raw email content (HTML, encoding, metadata)
- Image filtering: detect and exclude logos and signature banners
- Enrichment: inject current rates from the database into the prompt
- Batch inference: structured extraction via the internal batch service
Each step makes the email more useful to the model. Without this preprocessing, the context sent to the LLM would be noisy, expensive in tokens, and extractions less reliable.
Step 1: how to clean raw email content?
Received emails are multipart MIME messages. The useful text is often buried inside rich HTML: style tags, Outlook conditional comments, invisible tracking pixels.
This cleaning was done in SQL via DBT transformations. The data engineering goal was to push as many transformations as possible into SQL, to avoid maintaining DAGs and to keep a clean data lineage.
Step 2: why filter email images before calling the LLM?
This step had the biggest impact on input quality. Banker emails systematically contain signature logos, commercial banners, and sometimes legal stamps. Encoding these as base64 and including them in the context wastes tokens and can confuse the extraction.
The filtering strategy uses three criteria:
- The filename of the image attachment
- The MIME type of the file
- The dimensions
A signature logo typically matches at least two of these three criteria. The most robust signal is dimensions: logos are short in height or very elongated horizontally. A rate table sent as an image (a rare case) is larger and more square, so it passes the filter. This filtering was also done in dbt.
Step 3: how to enrich the prompt with internal data?
Before sending the email to the LLM, we inject the rates currently stored in the database for the sending bank into the prompt.
Why? The LLM needs to distinguish two cases: an update to an existing rate and the announcement of a new product not yet in the database. Without context, it cannot know what was there before. More critically, some emails express variations relatively ("a 10-basis-point cut") rather than absolutely. With current rates injected, the LLM can compute the final value.
The prompt was refined using Langfuse, with the goal of improving extraction quality by selecting the best prompt and LLM combination.
Step 4: batch inference and Pydantic validation
Once the emails are prepared, inference is handed off to the internal batch service. The call submits a batch of jobs, polls until completion, and retrieves the outputs. Provider batch APIs offer around 50% cost reduction versus synchronous calls, which on weekly volumes is a meaningful saving.
The LLM output is validated by Pydantic before any write to the database:
from pydantic import BaseModel
from datetime import date
class RateChange(BaseModel):
duration: int
new_rate: float
effective_date: date | None
change_type: str # "update" or "new"If the returned JSON is invalid (format hallucination, missing field), Pydantic raises an error that gets logged without silent writes to the database. This is non-negotiable: a silently misextracted financial figure is worse than missing data.
Key takeaways
Image filtering is not optional. Without it, base64-encoded logos inflate the context and degrade extraction quality. It is not a comfort optimization.
Contextual enrichment changes the equation. Injecting current rates into the prompt significantly reduces errors on emails expressing relative variations. The LLM without context will hallucinate absolute values when the email says "10bp cut."
Deferred batch processing is the right approach here. Rate extraction is not real-time-critical. Accumulating emails throughout the day and processing them in a batch run in the evening costs less, is simpler to operate, and the result is available the next morning.
The same pattern applies in other contexts: extracting information from supplier emails, processing orders received by email, classifying incoming support tickets. The email preprocessing is always the same; only the prompt and the Pydantic schema change depending on the business domain.
If your company processes email volumes and you need to extract structured data from them, let us talk.
Client testimonial
“Following his internship, we continued working with Pierre as a freelancer while he pursued his studies in parallel. He is hardworking, efficient, precise and reliable. Thank you again Pierre for all the great work, see you very soon :)”