Comprehensive Benchmark Report: Evaluating Python Text Extraction Libraries in 2025
In the rapidly evolving landscape of document processing, choosing the right Python library for text extraction can be daunting. To shed light on this critical decision, I recently conducted an extensive and transparent benchmarking exercise of four prominent libraries, covering real-world documents of diverse sizes and formats. Hereโs a detailed overview of what I found, designed to help developers, data scientists, and enterprise users make informed choices.
Benchmarking Overview: Setting the Stage
The need for reliable, fast, and resource-efficient text extraction tools cannot be overstated. For this study, I tested four leading Python libraries across 94 authentic documents, including PDFs, Word files, HTML pages, images, and spreadsheets, with sizes ranging from tiny files under 100KB to massive academic papers over 50MB. The documents spanned six languages: English, Hebrew, German, Chinese, Japanese, and Korean. To ensure fair comparisons, all tests were performed on CPU-only environments, with consistent settings and multiple runs for statistical validity.
The libraries evaluated include:
– Kreuzberg โ My own lightweight, high-speed extractor
– Docling โ IBMโs machine learning-powered solution
– MarkItDown โ Microsoft’s Markdown-focused tool
– Unstructured โ An enterprise-grade document processing framework
Results, encompassing speed, memory consumption, success rates, and installation package size, are openly accessible via the interactive dashboard.
Key Findings from the Benchmark
1. Performance and Speed
– Kreuzberg leads the pack with processing speeds exceeding 35 documents per second, demonstrating exceptional efficiency and versatility.
– Unstructured offers consistent reliability, processing a moderate number of documents swiftly and accurately.
– MarkItDown performs well with simpler files but encounters performance drops when handling complex formats.
– Docling, despite its advanced ML capabilities, is plagued by significant delays, often taking upwards of an hour per document, making it less practical for large-scale use cases.
2. Installation Footprint
– Kreuzberg stands out with a compact size (~71MB) and minimal dependencies, making deployment straightforward.
– Unstructured has a larger footprint (~146MB), justified by its extensive features.
– MarkItDown requires about 251MB, partly due to embedded