Overview
Odia is a low-resource Indic language with limited support for robust Optical Character Recognition (OCR). While OCR technology has been studied globally for decades, and various approaches exist for printed and handwritten text recognition across languages, Odia remains significantly underserved in both research and practical tools.
The Odia script is one of the oldest scripts in India, with roots tracing back over 1,500 years. It evolved from the Kalinga script, which itself is a descendant of the Brahmi script. Over the centuries, it developed its distinctive rounded shapes, which were ideal for writing on palm leaves, a traditional medium used in Odisha. Straight lines could easily tear the delicate leaves, so scribes preferred curved letters, which gave the Odia script its signature circular form.

Figure 1: Odia Script Across Different Mediums (Palm leaf, Metal plate, Paper)
Most Odia literature, newspapers, and historical manuscripts are preserved in palm-leaf documents, scanned images, and physical formats, making them difficult to digitize, search, and process using current tools. This lack of robust OCR support limits accessibility and inhibits downstream language technology development.
Key Challenges
Complex script structure: Odia ligatures and diacritics are difficult for conventional OCR systems and handcrafted feature models to recognize accurately.
- Data scarcity: There is a severe lack of annotated datasets for training modern machine learning and deep learning OCR models.
- Limited tooling: Existing systems are largely based on classical pattern recognition techniques and lack open-source, high-accuracy performance on Odia.
- Handwritten text gap: Handwritten Odia remains largely unsupported or has poor recognition accuracy using available techniques.
Without a robust, open, high-accuracy OCR system, Odia text remains inaccessible for digital archiving, computational analysis, and modern language technology applications such as NLP, search indexing, and multimodal AI systems.
Digitizing Odia Script: The Odia Lipi Project
Odia literature, manuscripts, newspapers, and historical documents hold centuries of knowledge, but much of it remains inaccessible in digital form.
Odia Lipi is an open, community-driven initiative by OdiaGenAI in collaboration with AHRC IIT Bhubaneswar to build a robust and high-accuracy OCR system along with data for the Odia language.
The project focuses on digitizing printed and handwritten Odia text from diverse sources, including palm-leaf manuscripts, historical documents, newspapers, and scanned images, by combining modern OCR techniques with multimodal large language models (LLMs).
We have benchmarked existing LLMs for Odia text extraction and are now planning to fine-tune state-of-the-art multimodal LLMs, such as Qwen and Gemini, to improve OCR performance. Leveraging their ability to jointly interpret images and text will help handle complex ligatures, enhance accuracy, and enable advanced language understanding for Odia.
Beyond text recognition, Odia Lipi aims to enable searchable, Unicode-compliant digital text and support downstream language technologies, including NLP, multimodal AI, and digital archiving, helping preserve and empower the Odia language and heritage in the digital era.
Framework
The Odia Lipi project follows an end-to-end, modular framework designed to build a robust, scalable, and open OCR system for the Odia language. The framework integrates classical OCR pipelines, deep learning models, and multimodal Large Language Models (LLMs) to handle both printed and handwritten Odia text from diverse and challenging sources.

Figure 2: Overall Framework of Odia Lipi Project
Odia OCR Annotation Platform
The Odia OCR Annotation Platform is an easy‑to‑use tool designed to support the digitization and annotation of Odia text from images. It lets users automatically extract Odia text from images or scanned documents, making visual text machine‑readable and editable.
The platform also includes an integrated spell checker to improve transcription accuracy and a synthetic OCR data generation feature that helps create additional annotated training data for OCR models. This combination of extraction, correction, and data augmentation makes it ideal for building high‑quality OCR datasets for Odia language research and applications.
Odia Spell Checker
The Odia Spell Checker, integrated with our OCR annotation platform, uses Levenshtein Distance and B-Tree indexing for fast and accurate word correction. It leverages the OdiaNLP dictionary to enhance OCR outputs for both printed and handwritten texts.
Synthetic Data Generation
Our synthetic data generation pipeline, integrated with the OCR annotation platform, creates Odia text resembling real-world documents, simulating variations in format, structure, and content. This enhances OCR model robustness and performance across diverse text types.

Figure 3: Synthetic Data Generated Through the Odia OCR Annotation Platform
DataSet
This small test set is a curated collection of Odia text paired with scanned images, created to support the development and evaluation of OCR models for the Odia script.
Each data entry includes an image of Odia text alongside its validated Unicode text annotation, making it ideal for training, validating, and benchmarking OCR systems.
The Dataset is publicly available in Hugging Face for research and non-commercial purposes.

Figure 4: Sample Dataset
Conclusion
The OdiaLipi project is a collaborative initiative dedicated to preserving and modernizing Odia language resources. By leveraging high-quality OCR datasets, OCR models, and multimodal LLMs, it enables accurate digitization of both printed and handwritten Odia texts, making them editable, searchable, and machine-readable.
We invite researchers, developers, and language enthusiasts to join us in building the OCR dataset, OCR models, and applications for Odia, contributing to a digital presence that brings Odia literature, manuscripts, and historical texts to the world.
