Comprehensive Benchmarking of Python Text Extraction Libraries in 2025: An In-Depth Analysis
In the rapidly evolving landscape of document processing, selecting the right Python library for text extraction can significantly impact project performance and reliability. As part of ongoing efforts to improve Kreuzbergโa Python library dedicated to efficient text extractionโI embarked on an extensive benchmarking project to evaluate how various popular libraries perform under real-world conditions. This article presents the methodology, key results, and practical recommendations based on benchmarking four prominent tools: Kreuzberg, Docling, MarkItDown, and Unstructured.
Purpose and Approach
The primary goal was to provide an honest, data-driven comparison of Python text extraction libraries across diverse document types, sizes, and languages. Unlike cherry-picked or idealized tests, this benchmark emphasizes real-world scenarios, analyzing 94 documentsโincluding PDFs, Word files, HTML pages, images, and spreadsheetsโranging from tiny files under 100KB to massive academic papers up to 59MB. The testing environment was standardized: all executions ran on CPU-only systems without GPU acceleration, ensuring fairness across tools.
Open-source transparency was a core principle. The entire benchmarking pipeline, data, and analysis scripts are publicly available, enabling reproducibility and ongoing updates.
Libraries Evaluated
- Kreuzberg: Our in-house solution optimized for speed and lightweight deployment.
- Docling: A machine learning-powered framework from IBM, known for advanced document understanding.
- MarkItDown: A Microsoft open-source Markdown converter, often used for simpler document conversions.
- Unstructured: Commercial-grade enterprise document processing, widely adopted for its reliability.
Evaluation Metrics
Metrics covered include processing speed (files per second), memory footprint, success rates, installation size, and handling of multilingual content. The performance was assessed across various document sizes and complexities, and the results shed light on each libraryโs strengths and limitations.
Key Findings and Performance Highlights
Speed and Efficiency
| Rank | Library | Approximate Processing Speed | Notes |
|———|—————-|——————————————|—————————————————————|
| 1 | Kreuzberg | Over 35 files/sec | Consistently fast across all document types |
| 2 | Unstructured | Moderate speed, reliable | Excellent reliability, decent speed |
| 3 | MarkItDown | Good for simple documents | Struggles with complex or large files |
|