Comprehensive Benchmarking of Python Text Extraction Libraries (2025 Results) — What You Need to Know
In the rapidly evolving landscape of document processing, choosing the right text extraction library can be a daunting task. To aid developers and data professionals, I conducted an in-depth, objective performance analysis of four leading Python libraries, testing their capabilities across a diverse set of real-world documents. Here’s what I found.
Why Conduct This Benchmark?
As the creator of Kreuzberg, a lightweight and efficient text extraction library, I wanted to evaluate how it stacks up against other popular options in the Python ecosystem. This comprehensive benchmark is designed to provide transparent, real-world insights based on automated testing of 94 documents that include PDFs, Word files, HTML pages, images, and spreadsheets, spanning sizes from tiny files to massive academic papers.
Note: All tests are fully reproducible and the code, data, and methodology are available openly to ensure transparency.
Libraries Under Review
- Kreuzberg: A minimal, high-speed library tailored for production environments (71MB, 20 dependencies).
- Docling: IBM’s machine learning-driven solution optimized for complex document understanding (1,032MB, 88 dependencies).
- MarkItDown: Microsoft’s Markdown-focused parser suitable for straightforward documents (251MB, 25 dependencies).
- Unstructured: A versatile, enterprise-grade document processing framework (146MB, 54 dependencies).
Benchmarking Approach
We evaluated each library on a comprehensive suite of metrics:
– Processing Speed: Files processed per second.
– Resource Consumption: Memory and CPU utilization.
– Reliability: Success rate across varied document types and sizes.
– Installation Footprint: Package size and dependencies.
– Multi-language Support: Handling documents in languages like English, German, Hebrew, Chinese, Japanese, and Korean.
All tests were performed on a CPU-only environment to ensure fairness, with each document processed multiple times for statistical accuracy.
Key Findings
Performance Highlights
| Rank | Library | Speed | Reliability | Notes |
|——–|———————-|——————|——————|————————————-|
| 1 | Kreuzberg | Over 35 files/sec | 99%+ success rate | Fastest, most consistent overall |
| 2 | Unstructured | Moderate speed | 88%+ success rate | Reliable with complex documents

