Meet olmOCR - Open Source OCR for Accurate Document Conversion

olmOCR, developed by the Allen Institute for Artificial Intelligence (AI2), is designed to tackle the challenge of converting PDFs and other documents into plain text while preserving their natural reading order. This tool is particularly useful for handling complex layouts, including tables, equations, and handwriting, making it a valuable asset for researchers, developers, and professionals dealing with large document datasets.

How It Works

olmOCR combines traditional optical character recognition (OCR) with advanced AI techniques. It uses a method called "document-anchoring" to extract text and layout information from PDFs, then processes rasterized images with a vision language model (VLM) to generate accurate, linearized text. This hybrid approach enhances accuracy and reduces errors, or "hallucinations," in the output.

Getting Started

To use olmOCR, you'll need a system with a recent NVIDIA GPU (like RTX 4090, L40S, A100, H100) and at least 20 GB of GPU RAM, plus 30 GB of free disk space. Installation involves setting up a Python environment with specific dependencies, such as poppler-utils and additional fonts, and can be done via Conda. Basic usage includes running commands to process single or multiple PDFs, with results viewable in JSON format.

Unexpected Detail: Cost-Effectiveness

An unexpected benefit is olmOCR's cost-effectiveness for large-scale processing, converting a million PDF pages for just $190 USD, which is about 1/32 the cost of using GPT-4o APIs, making it an economical choice for extensive document conversion tasks.


Survey Note: Comprehensive Analysis of olmOCR

In the era of artificial intelligence, the demand for high-quality textual data is paramount, especially for training large language models (LLMs) that power many AI applications. While the internet offers a vast repository of text, a significant portion of valuable information is locked within PDF files, which are notoriously difficult to parse due to their complex layouts and lack of logical text structure. This is where olmOCR, an open-source tool developed by the Allen Institute for Artificial Intelligence (AI2), comes into play. This survey note provides a detailed examination of olmOCR, covering its features, functionality, installation, comparison with other tools, limitations, and future prospects, ensuring a thorough understanding for researchers, developers, and document management professionals.

Introduction and Background

olmOCR is designed for high-throughput conversion of PDFs and other documents into plain text, with a focus on preserving the natural reading order. It supports complex elements such as tables, equations, and handwriting, making it versatile for a wide range of document types. The tool is particularly suited for preparing datasets for LLMs, addressing the increasing need for clean, structured text from unstructured documents. Developed by AI2, olmOCR is open-source and licensed under Apache 2.0, encouraging community contributions and customization.

Discover endless inspiration for your next project with Mobbin's stunning design resources and seamless systems—start creating today! Mobbin

The official website, olmOCR Website, provides an overview, while the GitHub repository, olmOCR GitHub, offers detailed documentation and code. A technical report published on arXiv, Technical Report on arXiv, delves into its methodology, highlighting its use in unlocking trillions of tokens from PDFs for language model training.

Features and Capabilities

olmOCR stands out with its robust feature set, tailored for both accuracy and efficiency:

  • Preservation of Natural Reading Order: Ensures text is extracted in the order intended for reading, maintaining the document's original structure, which is crucial for coherent text processing.
  • Support for Complex Elements: Handles tables, equations, and handwriting, making it suitable for academic papers, technical documentation, and handwritten notes. The technical report notes its ability to preserve structured content like sections, lists, and equations.
  • High-Throughput Conversion: Optimized for large-scale batch processing, it can handle millions of documents, as evidenced by its ability to convert a million PDF pages for $190 USD, compared to $6,240 for GPT-4o batch processing (see Table 3 below).
  • Unique Prompting Technique: Utilizes a vision language model (VLM), fine-tuned from Qwen2-VL-7B-Instruct, with a "document-anchoring" method. This involves extracting text blocks and images with position information, then prompting the VLM with rasterized images and metadata, improving accuracy and reducing hallucinations.
  • Open-Source and Customizable: Available under Apache 2.0, users can modify and extend the tool, with model weights, training code, and datasets released for community use, as seen in the Hugging Face collection, Hugging Face Collection.

The dataset used for training, olmOCR-mix-0225, includes nearly 260,000 PDF pages from over 100,000 documents, comprising web-crawled PDFs and public domain books from the Internet Archive, with a breakdown shown in Table 1 below.

Table 1: Training Set Composition

Source Unique docs Total pages
Web crawled PDFs 99,903 249,332
Internet Archive books 5,601 16,803
Total 105,504 266,135

The web PDFs are estimated to include various document types, as shown in Table 2, highlighting its diverse training data.

Table 2: Web PDFs Breakdown by Document Type (Estimated)

Document Type Fraction
Academic 60%
Brochure 12%
Legal 11%
Table 6%
Diagram 5%
Slideshow 2%
Other 4%

How It Works

olmOCR's operation combines traditional OCR with AI-driven enhancements, offering a hybrid approach for superior results:

  • Document-Anchoring: Extracts text and layout information using PyPDF, identifying text blocks, images, and their coordinates. This step ensures the VLM receives contextual metadata alongside visual input.
  • Rasterization: Converts PDF pages into images, with the longest dimension set to 1024 pixels, optimizing for VLM processing.
  • VLM Prompting: The fine-tuned VLM, based on Qwen2-VL-7B-Instruct, is prompted with the rasterized image and extracted metadata. This allows the model to interpret both visual and structural content, generating linearized text in natural reading order.
  • Batch Processing: Optimized with SGLang and vLLM, it achieves high throughput, with inference costs detailed in Table 3, showing its efficiency compared to commercial APIs.

Table 3: Inference Cost Comparison

Model Hardware Tokens/sec Pages/USD Cost per million pages
GPT-4o API - 80 - $12,480
GPT-4o Batch - 160 - $6,240
marker API - 800 - $1,250
MinerU L40s 238 1678 $596
olmOCR A100 80GB 1487 3700 $270
olmOCR L40S 906 5200 $190
olmOCR H100 80GB 3050 5200 $190

This cost-effectiveness, especially on H100 80GB hardware, makes olmOCR a compelling choice for large-scale document processing.

Installation and Usage

Getting started with olmOCR requires specific hardware and software setup, detailed as follows:

  • System Requirements: A recent NVIDIA GPU (tested on RTX 4090, L40S, A100, H100) with at least 20 GB GPU RAM and 30 GB free disk space. This requirement may limit accessibility for users without high-end hardware.
  • Installation Steps:
    • Install system dependencies: sudo apt-get update and sudo apt-get install poppler-utils ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools.
    • Create a Conda environment: conda create -n olmocr python=3.11 and activate it with conda activate olmocr.
    • Clone the repository: git clone https://github.com/allenai/olmocr and install with pip install -e ..
    • For GPU inference, install additional packages: pip install sgl-kernel==0.0.3.post1 --force-reinstall --no-deps and pip install "sglang[all]==0.4.2" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/.
  • Usage Examples:
    • Process a single PDF: python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/horribleocr.pdf.
    • Process multiple PDFs: python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/*.pdf.
    • View results: cat localworkspace/results/output_*.jsonl or use the viewer with python -m olmocr.viewer.dolmaviewer localworkspace/results/output_*.jsonl.

An online demo is available at Online Demo, allowing users to test the tool without local setup.

Comparison with Other OCR Tools

olmOCR distinguishes itself from other OCR tools through several key aspects:

  • Accuracy: Achieves a 0.875 alignment score with GPT-4o at temperature 0.1, outperforming tools like Marker and MinerU in ELO score evaluations, as shown in the technical report's Table 4.

Table 4: Alignment Scores

Model Temperature τ Alignment
GPT-4o (self-alignment) 0.1 0.954
GPT-4o mini 0.1 0.833
olmOCR 0.8 0.859
olmOCR 0.1 0.875
  • Speed and Cost: Its optimized pipeline reduces costs to $190 per million pages on H100 80GB, compared to $12,480 for GPT-4o API, as seen in Table 3.
  • Customizability: Being open-source, it allows users to fine-tune for specific needs, unlike many commercial tools.
  • Complex Layout Handling: The document-anchoring and VLM approach excels at tables, equations, and handwriting, addressing limitations in traditional OCR tools like GOT-OCR 2.0.

Downstream evaluation results, shown in Table 6, indicate olmOCR improves OLMo-2-7B by +1.3 percentage points on average across benchmarks like MMLU, ARC, and DROP, compared to Grobid + rules.

Table 6: Downstream Evaluation Results

PeS2o version Average MMLU ARC DROP HSwag NQ WinoG
Grobid + rules (Soldaini and Lo, 2023) 53.9 61.1 75.0 42.3 57.4 29.4 58.3
olmOCR 55.2 61.1 76.4 43.7 62.6 29.1 58.0

Limitations

Despite its strengths, olmOCR has notable limitations:

  • Language Support: Currently fine-tuned on English documents, its effectiveness with other languages may vary, as noted in various sources, limiting its global applicability.
  • GPU Requirements: The need for a powerful NVIDIA GPU with at least 20 GB RAM may exclude users without access to such hardware, potentially restricting its use in resource-constrained environments.
  • Learning Curve: Setup and usage involve technical expertise, particularly for GPU computing and Python programming, which may pose challenges for non-technical users.

Future Developments

The developers are actively working on enhancing olmOCR, with potential future updates including:

  • Expanded Language Support: Plans to improve performance with non-English documents, broadening its usability.
  • Enhanced Performance: Optimizations for faster processing and lower hardware requirements.
  • Additional Features: Potential support for more document types, such as improved handling of footnotes, bibliographies, and diagrams, addressing current gaps like the lack of diagram interpretation noted in Simon Willison's blog, Simon Willison's Blog.

Conclusion

olmOCR represents a significant advancement in PDF-to-text conversion, combining traditional OCR with AI-powered VLMs to achieve high accuracy and efficiency. Its cost-effectiveness, especially for large-scale processing, and open-source nature make it a valuable tool for researchers and developers. While it has limitations, such as language support and hardware requirements, ongoing developments promise to address these issues. To explore further, visit the official website at olmOCR Website or the GitHub repository at olmOCR GitHub.

Next Post Previous Post
No Comment
Add Comment
comment url