Comprehensive Benchmark of Python Text Extraction Libraries: 2025 Insights
Discover the performance differences among leading Python text extraction toolsโan in-depth analysis based on real-world documents.
Introduction
In the world of document processing, selecting the right text extraction library can significantly impact your project’s efficiency and reliability. Over the past year, I conducted an extensive evaluation of four prominent Python librariesโKreuzberg, Docling, MarkItDown, and Unstructuredโby testing them across a diverse set of 94 real-world documents totaling approximately 210MB. This comprehensive benchmarking provides valuable insights to guide your choice, whether for production deployment, research, or enterprise solutions.
Overview of the Benchmarked Libraries
- Kreuzberg: An open-source library designed for fast, reliable text extraction, supporting both synchronous and asynchronous processing, with OCR capabilities.
- Docling: A sophisticated, machine learning-based tool from IBM, offering deep document understanding but with a hefty installation size.
- MarkItDown: A lightweight solution optimized for converting simple PDFs and Office documents into Markdown, ideal for preprocessing large language models.
- Unstructured: An enterprise-grade library focusing on robustness across a variety of document formats; trade-offs include larger installation footprints.
Testing Methodology
The evaluation aimed for fairness and real-world applicability:
- Analyzing 94 documents spanning PDFs, Word, HTML, images, and spreadsheets.
- Covering multiple sizesโfrom tiny files (<100KB) to massive ones (>50MB).
- Supporting six languages, including English, Hebrew, German, Chinese, Japanese, and Korean.
- Running all tests on CPU-only environments to ensure consistency.
- Measures included processing speed, memory consumption, installation size, and success/failure rates.
- Each document processed using automated, reproducible workflows with multiple iterations for accuracy.
Key Findings
Speed and Performance
- Kreuzberg leads with an impressive rate of over 35 files per second, reliably handling diverse document types and sizes.
- Unstructured offers moderate speed, excelling in complex layouts with high consistency.
- MarkItDown performs admirably on simple, straightforward documents but falters with intricate or large files.
- Docling frequently spends over an hour per file, making it less suitable for high-volume workflows.
Installation and Footprint
- **Kreuzberg

