OdiaLipi-Qwen2.5-VLM

OdiaLipi-Qwen2.5-VLM

Bridging Vision and Language for Odia OCR

Overview

OdiaLipi-Qwen2.5-VLM is a vision-language OCR model designed to accurately recognize printed and handwritten Odia text from images and documents. Built on top of the Qwen2.5 Vision-Language architecture, the model bridges visual understanding and linguistic knowledge to address long-standing challenges in Odia OCR, such as complex character shapes, ligatures, and layout variations.

Unlike traditional OCR systems that rely on script-specific heuristics, OdiaLipi-Qwen2.5-VLM leverages multimodal learning to jointly reason over image content and language structure. This enables more robust recognition across diverse document types, including scanned books, historical manuscripts, handwritten notes, and real-world images.

The model has been fine-tuned on merged Odia OCR datasets and evaluated using standard OCR metrics such as Character Error Rate (CER) and Word Error Rate (WER). Early results demonstrate significant improvements over conventional OCR baselines, particularly in scenarios involving degraded scans, handwritten text, and complex layouts.

By making Odia OCR more accurate and accessible, OdiaLipi-Qwen2.5-VLM aims to support large-scale digitization efforts, downstream NLP research, and the preservation of Odia’s rich textual heritage.

Dataset

For OdiaLipi-Qwen2.5-VLM, we used the Odia OCR Merged dataset, a comprehensive multi-source Odia OCR collection curated by the OdiaGenAIOCR community. This dataset merges three major sources to provide a robust foundation of 192,000+ samples for both training and evaluation:

OdiaGenAIOCR/Odia-lipi-ocr-data: 64 word-level samples
tell2jyoti/odia-handwritten-ocr: 182,152 character-level samples (32x32px)
darknight054/indic-mozhi-ocr (Odia subset): 10,000+ printed word images

Dataset composition:

🖨️ Printed Odia text images — scans from books, articles, and documents with varied fonts and layouts
✍️ Handwritten Odia samples — real handwritten pages and notes
🧩 Mixed document styles — clear prints, noisy scans, uneven lighting, and diverse backgrounds

All images are paired with ground-truth Odia Unicode transcriptions, enabling supervised OCR training. The dataset includes all 47 Odia characters (vowels, consonants, and special marks), ensuring comprehensive script coverage.

Data splits:

Training: 58,720 examples
Validation: 7,340 examples
Test: 7,340 examples

This diverse and well-annotated dataset makes OdiaLipi-Qwen2.5-VLM robust for printed and handwritten text recognition, document-level OCR, and downstream applications like text digitization and language modeling.

Model

The odia-ocr-qwen-finetuned is a vision-language model fine-tuned on the Qwen2.5-VL-3B-Instruct architecture for high-accuracy Odia OCR. Trained on 58,720 validated text-image pairs from the Odia OCR Merged dataset, it bridges visual understanding and Odia language generation to accurately recognize both printed and handwritten Odia text.

Architecture & Features

Base Model: Qwen2.5-VL-3B-Instruct (3.78B parameters)
Type: Vision-Language Model (VLM) – multimodal
Precision: bfloat16 (mixed precision)
Vision Encoder: Extracts spatial and textural features from document images
Language Decoder: Generates Odia Unicode text, handling ligatures, conjunct consonants, and special marks
Cross-Modal Attention: Aligns visual input with Odia tokens for precise transcription

Training Highlights

Training Samples: 57,565
Evaluation Samples: 1,155
Batch Size: 1 (gradient accumulation x2)
Learning Rate: 2e-4
Hardware: NVIDIA A100 (80GB)
Training Time: ~4 hours (3 epochs)

Capabilities

Recognizes printed and handwritten Odia text from diverse documents
Handles complex fonts, ligatures, and mixed layouts
Optimized for document digitization, form processing, and archival tasks
Outputs Unicode Odia text, ready for downstream NLP or digital preservation pipelines

Performance

Character Error Rate (CER): 20–40% (document-dependent)
Exact Match Accuracy: 40–70% (post-processed)
Post-Processing Success: 100% on validation samples
Inference Time: ~30–45 seconds per image on A100 GPU

Availability

The OdiaLipi-Qwen2.5-VLM model is freely accessible for developers, researchers, and organizations looking to integrate high-accuracy Odia OCR into their workflows.'

You can explore and test the model through our live demo (in progress): https://huggingface.co/spaces/OdiaGenAIOCR/odia-ocr-qwen-finetuned

The model and all associated resources are available for download on Hugging Face:

https://huggingface.co/OdiaGenAIOCR/odia-ocr-qwen-finetuned

Conclusion

The OdiaLipi-Qwen2.5-VLM model demonstrates that fine-tuning on Odia-specific data delivers significant improvements in OCR performance. Achieving 58% accuracy on diverse Odia text images, it sets a high standard for reliability and correctness in real-world applications.