Comprehensive Performance Evaluation of Python Text Extraction Libraries in 2025
In the rapidly evolving field of document processing, selecting the right text extraction tool can significantly impact your project’s efficiency and reliability. Recently, I conducted an in-depth, unbiased benchmarking of four prominent Python libraries to help practitioners navigate this landscape with confidence. This comparison covers a broad spectrum of real-world documents and offers valuable insights into their strengths and limitations.
Understanding the Benchmark Framework
This evaluation was motivated by my work on Kreuzberg, a high-performance Python library designed for document processing. I aimed to objectively assess how Kreuzberg stacks up against other leading options, without bias or cherry-picking. The benchmark tests include 94 diverse documentsโranging from tiny text files to large academic PDFsโspanning different formats, languages, and complexities.
All tests are fully automated, reproducible, and based on open-source scripts. Hereโs an overview of the key aspects:
- Documents tested: PDFs, Word documents, HTML pages, images, spreadsheets
- Size categories: from small (<100KB) to enormous (>50MB)
- Languages: English, Hebrew, German, Chinese, Japanese, Korean
- Metrics: processing speed, memory consumption, success rate, installation size
- Processing environment: CPU-only, no GPU acceleration
The Libraries Evaluated
- Kreuzberg โ My own lightweight, fast, and versatile library, with minimal dependencies.
- Docling โ IBM’s machine learning-powered solution, known for its advanced comprehension.
- MarkItDown โ Microsoft’s Markdown-focused converter, optimized for basic documents.
- Unstructured โ A comprehensive enterprise-grade toolkit supporting complex layouts.
Key Findings at a Glance
Speed and Performance
- Kreuzberg leads with an exceptional processing rate of over 35 files per second, handling all document types comfortably.
- Unstructured offers a good balance with moderate speed and high reliability.
- MarkItDown performs well on straightforward documents but encounters issues with more complex files.
- Docling is significantly slowerโsometimes taking over an hour per fileโlimiting practical usability.
Installation Footprint
- Kreuzberg: Minimal at roughly 71MB with 20 dependencies.
- Unstructured: Larger at 146MB and 54 dependencies.
- MarkItDown: About 251MB, including onnx