Pierre KasparianAI & Data freelancer
← Back to category
PythonPDFLiteParseRAGGDPR

LiteParse v2.0: Local PDF Extraction Without LLM or Cloud

May 28, 2026 · 7 min read · Guides

Pierre Kasparian

AI Engineer — UTT 4th year · LLM, RAG & GDPR compliance specialist · 15+ client projects

Extracting text from enterprise PDFs without sending data to an external cloud service: that is exactly what LiteParse v2.0 delivers. A fully open source tool, rewritten in Rust by the LlamaIndex team, it processes documents locally, deterministically, and without a network connection.

LiteParse v2.0 parses PDFs and Office documents without any LLM and without calling a remote API. Text is extracted while preserving the original layout, with spatial bounding boxes, and with no external dependency. For organizations handling sensitive data (contracts, HR files, financial documents), this is a foundational component for a GDPR-compliant document pipeline.

Why "No Cloud" Matters for GDPR

Sending a PDF to a remote extraction API (Google Document AI, AWS Textract, Adobe Extract) is a data transfer to a third-party processor. Article 28 of the GDPR requires a Data Processing Agreement (DPA) for this transfer. If the processor is a US company, the CLOUD Act (2018) allows US authorities to demand access to that data, even when hosted in Europe.

With LiteParse, everything runs locally. No network call, no DPA to manage, no risk of data leaving the EU.

What LiteParse v2.0 Brings

LiteParse v2.0 is a complete rewrite of the original Node.js project, now using Rust as the core engine. Key changes:

  • Rust engine: up to 100x faster than v1.0 on small documents, 3x on large ones
  • Multi-language: native packages for Python, JavaScript/TypeScript, Rust, and WASM (browser and edge runtimes)
  • Built-in OCR via Tesseract-rs for scanned PDFs
  • No LLM required: pure structural extraction, deterministic and reproducible
  • Layout preservation: text is reconstructed according to its spatial position in the document

Published benchmark: 0.777 seconds for a 457-page, 100MB document.

Installation

Python

pip install liteparse

JavaScript / TypeScript

npm install @llamaindex/liteparse

Rust

cargo install liteparse

Browser and Edge Runtimes (WASM)

npm install @llamaindex/liteparse-wasm

The WASM package enables PDF parsing directly in the browser with no server required. OCR is handled through callbacks in WASM mode (Tesseract is not bundled by default in this runtime).

Basic Usage in Python

from liteparse import parse_pdf
 
result = parse_pdf("contract.pdf")
 
for page in result.pages:
    print(f"--- Page {page.number} ---")
    for block in page.blocks:
        print(block.text)
        print(f"  Bbox: {block.bbox}")  # (x, y, width, height)

Extraction preserves reading order and the spatial coordinates of each block, which simplifies chunking for a RAG pipeline.

LiteParse or PyMuPDF: Which One to Choose?

Both tools parse PDFs locally. Here are the key differences:

CriterionLiteParse v2.0PyMuPDF (fitz)
Language supportPython, JS, Rust, WASMPython only
EnginePDFium (custom fork)MuPDF (C)
Built-in OCRYes (Tesseract-rs)No (external only)
WASM / browserYesNo
EcosystemLlamaIndexUniversal
Maturityv2.0 (2026)Very mature

Simple rule: for a pure Python pipeline with fine-grained parsing control, PyMuPDF remains a solid reference. If you need multi-language support, WASM deployment, or native LlamaIndex integration, LiteParse is the natural choice.

Enterprise Use Cases

LiteParse is particularly well-suited for:

  • RAG pipelines on internal documents: contracts, HR files, technical documentation, without exposing PDFs externally
  • Invoice and receipt processing: fast, local, GDPR-compliant extraction
  • Web or edge applications: client-side parsing via WASM, no server needed
  • Compliance workflows: document auditing where data cannot leave the infrastructure

Conclusion

LiteParse v2.0 is a serious tool for local document extraction. Its Rust rewrite, multi-language support, built-in OCR, and WASM compatibility make it a versatile choice for teams that want to process PDFs quickly, accurately, and without sending data to the cloud.

For European companies managing sensitive data (client records, contracts, HR files), this is precisely the kind of tool to prioritize in a GDPR-compliant AI architecture.

If you want to build a document extraction pipeline adapted to your legal constraints, get in touch.

About the author

Pierre Kasparian

4th-year engineering student at UTT (University of Technology of Troyes) and AI integration freelancer. He deploys LLMs, RAG pipelines, and AI agents for French and European companies, with strong expertise in GDPR compliance and European hosting. 15+ client projects, including Pretto and LiveSession.