Comprehensive Benchmarking of Python Text Extraction Libraries โ 2025 Insights
Understanding the Performance Landscape of Text Extraction Tools in Python
In the rapidly evolving world of document processing, selecting the right text extraction library can significantly impact your project’s efficiency and reliability. Recently, I embarked on an extensive benchmarking journey of four prominent Python librariesโKreuzberg, Docling, MarkItDown, and Unstructuredโto provide developers and data scientists with clear, data-driven guidance.
An Honest Evaluation: Methodology and Scope
The benchmarking effort focused on analyzing the capabilities of these libraries across 94 real-world documents, totaling approximately 210MB. The dataset included diverse formats such as PDFs, Word files, HTML pages, images, and spreadsheets, making the results relevant for a wide array of applications. Documents ranged from small files under 100KB to massive academic papers exceeding 50MB, covering multiple languages including English, Hebrew, Chinese, Japanese, Korean, and German.
The entire testing process was conducted on CPU-only environments to ensure a fair comparison, with automated scripts running multiple iterations to gather statistically meaningful data. The benchmarks are fully open-source, reproducible, and designed to highlight practical performance metrics such as processing speed, memory consumption, success rate, and installation overhead.
Key Findings from the Benchmarking
Speed and Efficiency
-
Kreuzberg emerges as a clear leader in processing speed, capable of handling over 35 documents per second across various formats and sizes. Its optimized architecture allows for rapid, reliable extraction suitable for production deployments and edge environments like serverless platforms.
-
Unstructured demonstrates a balanced profile with good accuracy and reliability but operates at a moderate pace, making it suitable for complex enterprise workflows.
-
MarkItDown, while efficient for straightforward documents, shows limitations with complex or sizable files, especially those over 10MB.
-
Docling, leveraging advanced Machine Learning models, can provide insightful extraction but suffers from significant slowdown, sometimes taking over an hour per file, and frequently encountering timeouts on medium-sized documents.
Installation Footprint and Resource Usage
-
Kreuzberg’s lightweight design is notable, with a total size of approximately 71MB and only 20 dependencies, making deployment straightforward.
-
In contrast, Unstructured requires around 146MB and 54 dependencies, while MarkItDown comes in at about 251MB with 25 dependencies, partly due to support for formats like ONNX.
-
Docling’s installation footprint is substantial, exceeding 1GB