I benchmarked 4 Python text extraction libraries so you don’t have to (2025 results)

Comprehensive Evaluation of Python Text Extraction Libraries in 2025: An In-Depth Benchmarking Report

In the rapidly evolving landscape of document processing, selecting the optimal Python library for text extraction can be a daunting task. To aid developers and data scientists, we present an impartial and thorough comparison of four prominent Python-based text extraction tools, based on rigorous testing conducted in 2025. This analysis scrutinizes their performance, reliability, and suitability across a diverse array of real-world documents.

Overview of the Benchmark

Our evaluation examines four leading libraries:

Kreuzberg – an open-source, lightweight solution designed for high-speed processing
Docling – IBM’s machine learning-powered extraction framework
MarkItDown – Microsoft’s Markdown-focused converter for simplified documents
Unstructured – a versatile platform aimed at enterprise-level document handling

The testing protocol encompassed an extensive dataset of 94 documents, including PDFs, Word files, HTML pages, images, and spreadsheets. These files ranged from small (<100KB) to very large (>50MB), encompassing multiple languages such as English, Hebrew, German, Chinese, Japanese, and Korean. To ensure fairness, all operations were conducted on CPU-only systems, with meticulous tracking of processing times, memory consumption, and success rates.

Key Findings from the 2025 Results

Performance and Speed

Our tests revealed striking disparities among the libraries:

Kreuzberg emerged as the fastest, processing over 35 documents per second across the board, excelling especially with large or complex files.
Unstructured provided robust reliability but at moderate speeds, making it suitable for scenarios where consistency is paramount.
MarkItDown handled straightforward, simple documents efficiently but struggled with complex or sizable files, particularly those exceeding 10MB.
Docling, despite its advanced ML capabilities, often required more than 60 minutes per document, with frequent timeouts on medium-sized files.

Installation Footprint and Resource Usage

The size of the library installation directly impacts deployment options:

Kreuzberg boasts a compact size of approximately 71MB with only 20 dependencies, making it ideal for resource-constrained environments.
Unstructured is larger, around 146MB, with 54 dependencies, suitable for enterprise contexts.
MarkItDown has a moderate footprint (~251MB) but includes heavy dependencies like ONNX.
**Docling