Pierre KasparianAI & Data freelancer
← Back to work
Automatic business classification - Médée

2024

Automatic business classification - Médée

A model that automatically classifies businesses by sector to spot high-potential market areas.

PythonMLSVMRandom ForestMistral

Machine learning mission for Médée: design an automatic business classification model for two sectors (photography or food) for a multinational client. The model was then used to identify geographic areas lacking relevant points of sale, supporting market analysis. 1. Data collection: business data enrichment via website metadata (text content, titles, descriptions), third-party APIs (Google Maps, user reviews), and image analysis via LLMs to generate contextual descriptions. 2. Data preparation: cleaning, filtering out entries with insufficient information, vectorisation of text data using Mistral embeddings. 3. Modelling: testing multiple supervised classifiers (SVM, Random Forest) on embeddings, achieving 90%+ accuracy with a low false-positive rate.

Detailed case study

Médée is a technology services company (SAS) that, alongside its own product (a health advice chatbot), operates as a tech contractor for external clients. In this case, they were acting as an intermediary for a multinational client (whose name I am withholding for confidentiality reasons). The goal: automatically classify businesses into two target sectors (photography and food/catering) to identify geographic areas lacking relevant points of sale.

The starting point was an Excel file provided by the client: business addresses with their ground-truth labels (in-target or out-of-target). This annotated dataset served as the training base. The mission was an MVP designed to validate feasibility before a second, larger engagement.

The starting point: an Excel file and a simple question

The client had a precise need: given a list of businesses with their addresses, determine which ones belong to the photography or food sector.

The end use case was geographic. By automatically classifying all businesses in an area, you can spot underrepresented sectors: a shopping street with no professional photographer, a neighbourhood with no specialised caterer. What the multinational cared about were those blank spots on the map.

The ground-truth Excel covered a few hundred manually labelled businesses. Not enough to train a robust model directly: the data needed enriching.

Data collection: scraping and multi-source enrichment

Three enrichment sources were used.

Business websites. For each business with a website, we extracted the textual content (titles, meta-descriptions, main text). This is often the most discriminating source: a food business uses very different words on its website than a photography studio.

Pages Jaunes (French business directory). Directory listings give access to the declared category, customer reviews, and sometimes an activity description. The problem: Pages Jaunes is protected against scraping. The solution was to use rotating residential proxies to bypass detection, simulating requests from real domestic connections.

LLM-based enrichment from images. Some businesses had no website or usable Pages Jaunes listing, but did have an image (logo, storefront photo). We used a vision model to generate a textual description of the image, which then fed into the vectorisation pipeline.

Data preparation: cleaning before vectorising

Not all businesses were usable. Two types of cases were filtered out:

  • No website, no Pages Jaunes listing, no image. Without any textual or visual signal, it is impossible to build a meaningful embedding. These businesses were removed from the training set.
  • Too little data. Some entries had only a generic business name and nothing else. Keeping them would have introduced noise with no predictive value.

After cleaning, the consolidated text for each business (website + Pages Jaunes + image description) was vectorised with Mistral embeddings. Each business became a dense vector of several hundred dimensions, capturing the semantic meaning of all its textual signals.

Modelling: why XGBoost on embeddings

Several classifiers were tested on these embeddings: SVM, Random Forest, XGBoost. XGBoost produced the best results, and this was not by chance.

The main reason: missing data. Even after cleaning, some businesses did not have all sources filled in. A website but no Pages Jaunes. An image but no website. Tree-based models (Random Forest, XGBoost) handle this kind of incomplete tabular data natively, where an SVM on concatenated embeddings would be sensitive to the structure of missing inputs.

XGBoost adds gradient boosting on top: each tree learns from the errors of the previous one, which improves robustness on edge cases (ambiguous businesses whose name or website gives no clear signal).

from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score
 
model = XGBClassifier(
    n_estimators=200,
    max_depth=6,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    use_label_encoder=False,
    eval_metric="logloss",
)
 
scores = cross_val_score(model, X_embeddings, y, cv=5, scoring="f1_weighted")

5-fold cross-validation measured accuracy on unseen subsets of the client Excel, avoiding overfitting on a relatively small labelled dataset.

Results

91% accuracy on the validation set. For an MVP on enriched, partially missing data, this was sufficient to justify the second engagement.

Low false-positive rate. The most costly error for the end client was classifying an out-of-target business as in-target (investing in an area that does not warrant it). Per-class precision on true positives was the primary optimisation target.

Identifiable market gaps. By applying the model to new addresses, it becomes possible to map areas by target density. That is the expected output for the second mission.

What this project shows about web-sourced classification data

Two observations applicable to other classification projects on web data.

Textual signal beats categorical signal. Declared categories on Pages Jaunes are often too broad or poorly filled in. Raw website text is discriminating: a cooking school talks about "courses", "recipes", "apron". A photo studio talks about "session", "portrait", "retouching". Embeddings capture this semantics without manual rules.

Residential proxies are not a detail. Pages Jaunes detects and blocks requests from datacentres. Without rotating residential proxies, scraping stops after a few dozen requests. On a classification project at scale, this is an infrastructure constraint to anticipate during scoping.

Pitfalls to avoid

Keeping businesses with no signal. It is tempting to retain all entries from the client Excel to maximise training volume. In practice, an entry without text or image produces only a null or noisy vector: it degrades performance rather than improving it.

One model for everything. The "photography" and "food" sectors have very different signals. Depending on the requirements, a one-vs-rest model per sector can outperform a generic multiclass model.

Ignoring class imbalance. If the client Excel has 80% out-of-target and 20% in-target businesses, a naive model that always predicts "out-of-target" achieves 80% overall accuracy while learning nothing. Use weighted F1-score and XGBoost's scale_pos_weight parameter to compensate.

Conclusion: TL;DR

Automatic business classification by sector, from scraped and LLM-enriched data, to map high-potential market areas.

Key points:

  • Client Excel + Pages Jaunes scraping (residential proxies) + websites + LLM image descriptions
  • Mistral embeddings to vectorise heterogeneous textual signals
  • XGBoost to handle per-source missing data and maximise accuracy
  • 91% accuracy, foundation for the second engagement

If you have a need for automatic qualification or classification from web data, let's talk.