I benchmarked 4 Python text extraction libraries so you don’t have to (2025 results)

Comprehensive Benchmarking of Python Text Extraction Libraries in 2025: An In-Depth Analysis

In the rapidly evolving landscape of document processing, selecting the right Python library for text extraction can significantly impact project performance and reliability. As part of ongoing efforts to improve Kreuzberg—a Python library dedicated to efficient text extraction—I embarked on an extensive benchmarking project to evaluate how various popular libraries perform under real-world conditions. This article presents the methodology, key results, and practical recommendations based on benchmarking four prominent tools: Kreuzberg, Docling, MarkItDown, and Unstructured.

Purpose and Approach

The primary goal was to provide an honest, data-driven comparison of Python text extraction libraries across diverse document types, sizes, and languages. Unlike cherry-picked or idealized tests, this benchmark emphasizes real-world scenarios, analyzing 94 documents—including PDFs, Word files, HTML pages, images, and spreadsheets—ranging from tiny files under 100KB to massive academic papers up to 59MB. The testing environment was standardized: all executions ran on CPU-only systems without GPU acceleration, ensuring fairness across tools.

Open-source transparency was a core principle. The entire benchmarking pipeline, data, and analysis scripts are publicly available, enabling reproducibility and ongoing updates.

Libraries Evaluated

Kreuzberg: Our in-house solution optimized for speed and lightweight deployment.
Docling: A machine learning-powered framework from IBM, known for advanced document understanding.
MarkItDown: A Microsoft open-source Markdown converter, often used for simpler document conversions.
Unstructured: Commercial-grade enterprise document processing, widely adopted for its reliability.

Evaluation Metrics

Metrics covered include processing speed (files per second), memory footprint, success rates, installation size, and handling of multilingual content. The performance was assessed across various document sizes and complexities, and the results shed light on each library’s strengths and limitations.

Key Findings and Performance Highlights

Speed and Efficiency