2025 Text Extraction Libraries Benchmark: Which One Comes Out on Top?
When it comes to extracting text from a variety of document formats using Python, selecting the right library can be a daunting task. To assist developers and data scientists in making informed decisions, I’ve conducted a thorough and impartial benchmarking study of four prominent Python text extraction frameworks. The results, based on extensive testing with real-world documents, may challenge your expectations.
Discover the Live Benchmark Results
Visit the interactive dashboard for detailed performance metrics and comparisons.
Setting the Stage
As the creator of Kreuzbergโa lightweight, high-performance text extraction libraryโI was motivated to evaluate how similar tools stack up in practical scenarios. This benchmarking effort aims to provide honest, reproducible data by testing each library against a diverse collection of 94 real documents, encompassing formats like PDFs, Word files, HTML pages, images, and spreadsheets. The dataset covers a range of sizes from tiny files under 100KB to massive academic papers exceeding 50MB, across six languages including English, Hebrew, German, Chinese, Japanese, and Korean.
Note: While I am the author of Kreuzberg, these tests are fully automated, open-source, and free from biasโdesigned solely to deliver transparent performance insights.
The Contenders
- Kreuzberg โ My own library, optimized for speed and minimal dependencies.
- Docling โ IBM’s powerful machine-learning-based document understanding tool.
- MarkItDown โ Microsoft’s simple-to-use Markdown converter, often employed for lightweight processing.
- Unstructured โ An enterprise-oriented library focusing on high reliability across complex documents.
How Did They Perform?
Speed and Efficiency
- Kreuzberg leads with impressive processing rates, handling over 35 files per second while maintaining reliability across document types.
- Unstructured offers solid performance, albeit at a slower pace but with greater consistency.
- MarkItDown excels in straightforward cases โ quick and lightweight โ but falters with complex or large files.
- Docling struggles with speed, sometimes taking over an hour to process single documents, making it less suitable for time-sensitive applications.
Installation Footprint
- Kreuzberg stands out with just 71MB and only 20 dependencies, ideal for deployment in resource-constrained environments.
- **Unstructured