I benchmarked 4 Python text extraction libraries so you don’t have to (2025 results)

2025 Text Extraction Libraries Benchmark: Which One Comes Out on Top?

When it comes to extracting text from a variety of document formats using Python, selecting the right library can be a daunting task. To assist developers and data scientists in making informed decisions, I’ve conducted a thorough and impartial benchmarking study of four prominent Python text extraction frameworks. The results, based on extensive testing with real-world documents, may challenge your expectations.


Discover the Live Benchmark Results

Visit the interactive dashboard for detailed performance metrics and comparisons.


Setting the Stage

As the creator of Kreuzbergโ€”a lightweight, high-performance text extraction libraryโ€”I was motivated to evaluate how similar tools stack up in practical scenarios. This benchmarking effort aims to provide honest, reproducible data by testing each library against a diverse collection of 94 real documents, encompassing formats like PDFs, Word files, HTML pages, images, and spreadsheets. The dataset covers a range of sizes from tiny files under 100KB to massive academic papers exceeding 50MB, across six languages including English, Hebrew, German, Chinese, Japanese, and Korean.

Note: While I am the author of Kreuzberg, these tests are fully automated, open-source, and free from biasโ€”designed solely to deliver transparent performance insights.


The Contenders

  • Kreuzberg โ€” My own library, optimized for speed and minimal dependencies.
  • Docling โ€” IBM’s powerful machine-learning-based document understanding tool.
  • MarkItDown โ€” Microsoft’s simple-to-use Markdown converter, often employed for lightweight processing.
  • Unstructured โ€” An enterprise-oriented library focusing on high reliability across complex documents.

How Did They Perform?

Speed and Efficiency

  • Kreuzberg leads with impressive processing rates, handling over 35 files per second while maintaining reliability across document types.
  • Unstructured offers solid performance, albeit at a slower pace but with greater consistency.
  • MarkItDown excels in straightforward cases โ€” quick and lightweight โ€” but falters with complex or large files.
  • Docling struggles with speed, sometimes taking over an hour to process single documents, making it less suitable for time-sensitive applications.

Installation Footprint

  • Kreuzberg stands out with just 71MB and only 20 dependencies, ideal for deployment in resource-constrained environments.
  • **Unstructured

Leave a Reply

Your email address will not be published. Required fields are marked *