Meet olmOCR - Open Source OCR for Accurate Document Conversion
olmOCR, developed by the Allen Institute for Artificial Intelligence (AI2), is designed to tackle the challenge of converting PDFs and other documents into plain text while preserving their natural reading order. This tool is particularly useful for handling complex layouts, including tables, equations, and handwriting, making it a valuable asset for researchers, developers, and professionals dealing with large document datasets.
How It Works
olmOCR combines traditional optical character recognition (OCR) with advanced AI techniques. It uses a method called "document-anchoring" to extract text and layout information from PDFs, then processes rasterized images with a vision language model (VLM) to generate accurate, linearized text. This hybrid approach enhances accuracy and reduces errors, or "hallucinations," in the output.
Getting Started
To use olmOCR, you'll need a system with a recent NVIDIA GPU (like RTX 4090, L40S, A100, H100) and at least 20 GB of GPU RAM, plus 30 GB of free disk space. Installation involves setting up a Python environment with specific dependencies, such as poppler-utils and additional fonts, and can be done via Conda. Basic usage includes running commands to process single or multiple PDFs, with results viewable in JSON format.
Unexpected Detail: Cost-Effectiveness
An unexpected benefit is olmOCR's cost-effectiveness for large-scale processing, converting a million PDF pages for just $190 USD, which is about 1/32 the cost of using GPT-4o APIs, making it an economical choice for extensive document conversion tasks.
Survey Note: Comprehensive Analysis of olmOCR
In the era of artificial intelligence, the demand for high-quality textual data is paramount, especially for training large language models (LLMs) that power many AI applications. While the internet offers a vast repository of text, a significant portion of valuable information is locked within PDF files, which are notoriously difficult to parse due to their complex layouts and lack of logical text structure. This is where olmOCR, an open-source tool developed by the Allen Institute for Artificial Intelligence (AI2), comes into play. This survey note provides a detailed examination of olmOCR, covering its features, functionality, installation, comparison with other tools, limitations, and future prospects, ensuring a thorough understanding for researchers, developers, and document management professionals.
Introduction and Background
olmOCR is designed for high-throughput conversion of PDFs and other documents into plain text, with a focus on preserving the natural reading order. It supports complex elements such as tables, equations, and handwriting, making it versatile for a wide range of document types. The tool is particularly suited for preparing datasets for LLMs, addressing the increasing need for clean, structured text from unstructured documents. Developed by AI2, olmOCR is open-source and licensed under Apache 2.0, encouraging community contributions and customization.
Discover endless inspiration for your next project with Mobbin's stunning design resources and seamless systems—start creating today! Mobbin
The official website, olmOCR Website, provides an overview, while the GitHub repository, olmOCR GitHub, offers detailed documentation and code. A technical report published on arXiv, Technical Report on arXiv, delves into its methodology, highlighting its use in unlocking trillions of tokens from PDFs for language model training.
Features and Capabilities
olmOCR stands out with its robust feature set, tailored for both accuracy and efficiency:
- Preservation of Natural Reading Order: Ensures text is extracted in the order intended for reading, maintaining the document's original structure, which is crucial for coherent text processing.
- Support for Complex Elements: Handles tables, equations, and handwriting, making it suitable for academic papers, technical documentation, and handwritten notes. The technical report notes its ability to preserve structured content like sections, lists, and equations.
- High-Throughput Conversion: Optimized for large-scale batch processing, it can handle millions of documents, as evidenced by its ability to convert a million PDF pages for $190 USD, compared to $6,240 for GPT-4o batch processing (see Table 3 below).
- Unique Prompting Technique: Utilizes a vision language model (VLM), fine-tuned from Qwen2-VL-7B-Instruct, with a "document-anchoring" method. This involves extracting text blocks and images with position information, then prompting the VLM with rasterized images and metadata, improving accuracy and reducing hallucinations.
- Open-Source and Customizable: Available under Apache 2.0, users can modify and extend the tool, with model weights, training code, and datasets released for community use, as seen in the Hugging Face collection, Hugging Face Collection.
The dataset used for training, olmOCR-mix-0225, includes nearly 260,000 PDF pages from over 100,000 documents, comprising web-crawled PDFs and public domain books from the Internet Archive, with a breakdown shown in Table 1 below.
Table 1: Training Set Composition
Source | Unique docs | Total pages |
---|---|---|
Web crawled PDFs | 99,903 | 249,332 |
Internet Archive books | 5,601 | 16,803 |
Total | 105,504 | 266,135 |
The web PDFs are estimated to include various document types, as shown in Table 2, highlighting its diverse training data.
Table 2: Web PDFs Breakdown by Document Type (Estimated)
Document Type | Fraction |
---|---|
Academic | 60% |
Brochure | 12% |
Legal | 11% |
Table | 6% |
Diagram | 5% |
Slideshow | 2% |
Other | 4% |
How It Works
olmOCR's operation combines traditional OCR with AI-driven enhancements, offering a hybrid approach for superior results:
- Document-Anchoring: Extracts text and layout information using PyPDF, identifying text blocks, images, and their coordinates. This step ensures the VLM receives contextual metadata alongside visual input.
- Rasterization: Converts PDF pages into images, with the longest dimension set to 1024 pixels, optimizing for VLM processing.
- VLM Prompting: The fine-tuned VLM, based on Qwen2-VL-7B-Instruct, is prompted with the rasterized image and extracted metadata. This allows the model to interpret both visual and structural content, generating linearized text in natural reading order.
- Batch Processing: Optimized with SGLang and vLLM, it achieves high throughput, with inference costs detailed in Table 3, showing its efficiency compared to commercial APIs.
Table 3: Inference Cost Comparison
Model | Hardware | Tokens/sec | Pages/USD | Cost per million pages |
---|---|---|---|---|
GPT-4o API | - | 80 | - | $12,480 |
GPT-4o Batch | - | 160 | - | $6,240 |
marker API | - | 800 | - | $1,250 |
MinerU | L40s | 238 | 1678 | $596 |
olmOCR | A100 80GB | 1487 | 3700 | $270 |
olmOCR | L40S | 906 | 5200 | $190 |
olmOCR | H100 80GB | 3050 | 5200 | $190 |
This cost-effectiveness, especially on H100 80GB hardware, makes olmOCR a compelling choice for large-scale document processing.
Installation and Usage
Getting started with olmOCR requires specific hardware and software setup, detailed as follows:
- System Requirements: A recent NVIDIA GPU (tested on RTX 4090, L40S, A100, H100) with at least 20 GB GPU RAM and 30 GB free disk space. This requirement may limit accessibility for users without high-end hardware.
- Installation Steps:
- Install system dependencies:
sudo apt-get update
andsudo apt-get install poppler-utils ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools
. - Create a Conda environment:
conda create -n olmocr python=3.11
and activate it withconda activate olmocr
. - Clone the repository:
git clone https://github.com/allenai/olmocr
and install withpip install -e .
. - For GPU inference, install additional packages:
pip install sgl-kernel==0.0.3.post1 --force-reinstall --no-deps
andpip install "sglang[all]==0.4.2" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/
.
- Install system dependencies:
- Usage Examples:
- Process a single PDF:
python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/horribleocr.pdf
. - Process multiple PDFs:
python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/*.pdf
. - View results:
cat localworkspace/results/output_*.jsonl
or use the viewer withpython -m olmocr.viewer.dolmaviewer localworkspace/results/output_*.jsonl
.
- Process a single PDF:
An online demo is available at Online Demo, allowing users to test the tool without local setup.
Comparison with Other OCR Tools
olmOCR distinguishes itself from other OCR tools through several key aspects:
- Accuracy: Achieves a 0.875 alignment score with GPT-4o at temperature 0.1, outperforming tools like Marker and MinerU in ELO score evaluations, as shown in the technical report's Table 4.
Table 4: Alignment Scores
Model | Temperature τ | Alignment |
---|---|---|
GPT-4o (self-alignment) | 0.1 | 0.954 |
GPT-4o mini | 0.1 | 0.833 |
olmOCR | 0.8 | 0.859 |
olmOCR | 0.1 | 0.875 |
- Speed and Cost: Its optimized pipeline reduces costs to $190 per million pages on H100 80GB, compared to $12,480 for GPT-4o API, as seen in Table 3.
- Customizability: Being open-source, it allows users to fine-tune for specific needs, unlike many commercial tools.
- Complex Layout Handling: The document-anchoring and VLM approach excels at tables, equations, and handwriting, addressing limitations in traditional OCR tools like GOT-OCR 2.0.
Downstream evaluation results, shown in Table 6, indicate olmOCR improves OLMo-2-7B by +1.3 percentage points on average across benchmarks like MMLU, ARC, and DROP, compared to Grobid + rules.
Table 6: Downstream Evaluation Results
PeS2o version | Average | MMLU | ARC | DROP | HSwag | NQ | WinoG |
---|---|---|---|---|---|---|---|
Grobid + rules (Soldaini and Lo, 2023) | 53.9 | 61.1 | 75.0 | 42.3 | 57.4 | 29.4 | 58.3 |
olmOCR | 55.2 | 61.1 | 76.4 | 43.7 | 62.6 | 29.1 | 58.0 |
Limitations
Despite its strengths, olmOCR has notable limitations:
- Language Support: Currently fine-tuned on English documents, its effectiveness with other languages may vary, as noted in various sources, limiting its global applicability.
- GPU Requirements: The need for a powerful NVIDIA GPU with at least 20 GB RAM may exclude users without access to such hardware, potentially restricting its use in resource-constrained environments.
- Learning Curve: Setup and usage involve technical expertise, particularly for GPU computing and Python programming, which may pose challenges for non-technical users.
Future Developments
The developers are actively working on enhancing olmOCR, with potential future updates including:
- Expanded Language Support: Plans to improve performance with non-English documents, broadening its usability.
- Enhanced Performance: Optimizations for faster processing and lower hardware requirements.
- Additional Features: Potential support for more document types, such as improved handling of footnotes, bibliographies, and diagrams, addressing current gaps like the lack of diagram interpretation noted in Simon Willison's blog, Simon Willison's Blog.
Conclusion
olmOCR represents a significant advancement in PDF-to-text conversion, combining traditional OCR with AI-powered VLMs to achieve high accuracy and efficiency. Its cost-effectiveness, especially for large-scale processing, and open-source nature make it a valuable tool for researchers and developers. While it has limitations, such as language support and hardware requirements, ongoing developments promise to address these issues. To explore further, visit the official website at olmOCR Website or the GitHub repository at olmOCR GitHub.