Odia-OCR-Qwen2.5-VLM

Odia-OCR-Qwen2.5-VLM

Bridging Vision and Language for Odia OCR

Overvforiew

odia-ocr-qwen-finetuned is a vision-language model designed to accurately recognize printed and handwritten Odia text from images and documents. Built on top of the Qwen2.5 Vision-Language architecture, the model bridges visual understanding and linguistic knowledge to address long-standing challenges in Odia OCR, such as complex character shapes, ligatures, and layout variations.

Unlike traditional OCR systems that rely on script-specific heuristics, odia-ocr-qwen-fiinetuned leverages multimodal learning to jointly reason over image content and language structure. This enables more robust recognition across diverse document types, including scanned books, historical manuscripts, handwritten notes, and real-world images.

The model has been fine-tuned on merged Odia OCR datasets and evaluated using standard OCR metrics such as Character Error Rate (CER) and Word Error Rate (WER). Early results demonstrate significant improvements over conventional OCR baselines, particularly in scenarios involving degraded scans, handwritten text, and complex layouts.

By making Odia OCR more accurate and accessible, odia-ocr-qwen-finetuned aims to support large-scale digitization efforts, downstream NLP research, and the preservation of Odia’s rich textual heritage.

Dataset

To ensure high-quality supervision, we filtered, cleaned, and merged multiple Odia OCR datasets into a single unified benchmark: the Odia OCR Merged dataset. This multi-source collection was curated by OdiaGenAI with active coverage of the ScriptOCR community, creating a robust foundation for both training and evaluation.

The merged dataset integrates three major sources:

OdiaGenAIOCR/Odia-lipi-ocr-data — 64 word-level samples
tell2jyoti/odia-handwritten-ocr — 182,152 character-level handwritten samples (32×32 px)
darknight054/indic-mozhi-ocr (Odia subset) — 10,000+ printed word images

After filtering noisy samples and normalizing annotations, the final dataset represents a wide range of real-world Odia text conditions:

🖨️ Printed Odia text — scanned books, articles, and documents with diverse fonts and layouts
✍️ Handwritten Odia samples — real handwritten pages and notes
🧩 Mixed document styles — clean prints, degraded scans, uneven lighting, and complex backgrounds

All images are paired with ground-truth Odia Unicode transcriptions, enabling supervised OCR training. The dataset covers all 47 Odia characters, including vowels, consonants, and special diacritics, ensuring complete script coverage.

Data Splits

Training: 58,720 samples
Validation: 7,340 samples
Test: 7,340 samples

This carefully filtered and well-annotated merged dataset significantly improves the robustness of odia-ocr-qwen-finetuned, making it suitable for printed and handwritten text recognition, document-level OCR, and downstream applications such as text digitization and language modeling.

Model

The odia-ocr-qwen-finetuned is a vision-language model fine-tuned on the Qwen2.5-VL-3B-Instruct architecture for high-accuracy Odia OCR. Trained on 58,720 validated text-image pairs from the Odia OCR Merged dataset, it bridges visual understanding and Odia language generation to accurately recognize both printed and handwritten Odia text.

Architecture & Features

Base Model: Qwen2.5-VL-3B-Instruct (3.78B parameters)
Type: Vision-Language Model (VLM) – multimodal
Precision: bfloat16 (mixed precision)
Vision Encoder: Extracts spatial and textural features from document images
Language Decoder: Generates Odia Unicode text, handling ligatures, conjunct consonants, and special marks
Cross-Modal Attention: Aligns visual input with Odia tokens for precise transcription

Training Highlights

Base model: Qwen2.5-VL-3B-Instruct
Fine-tuning method: LoRA (r=64, α=128) 
Dataset: ~73,000 Odia word-level OCR image crops from Odia OCR Merged dataset
Final checkpoint: Step 6,400 (early stopped due to convergence) 
Train loss near final: ~4.83 (plateau) 
Best eval loss: ~5.45 
Training hardware: NVIDIA H100 80 GB 
LR schedule: Cosine decay (start 3×10⁻⁴ → final ~2.7×10⁻⁵) 
Effective batch size: 8 (per device 2 × grad accum 4) 
Duration: ~12.7 hours 

Note: Training was stopped early because loss ceased improving, showing effective adaptation to word-level OCR data.

Capabilities

Recognizes printed and handwritten Odia text from diverse documents
Handles complex fonts, ligatures, and mixed layouts
Optimized for document digitization, form processing, and archival tasks
Outputs Unicode Odia text, ready for downstream NLP or digital preservation pipelines

Performance

Word Accuracy (printed, held-out): ~65–70% (exact match on well-segmented crops) 
CER estimate on training/eval data: <0.15 (~15%) 
Qualitative examples show exact matches, partial errors, and failures on degraded or long samples 

Availability

The odia-ocr-qwen-finetuned model is freely accessible for developers, researchers, and organizations looking to integrate high-accuracy Odia OCR into their workflows.'

You can explore and test the model through our live demo (in progress): https://huggingface.co/spaces/OdiaGenAIOCR/odia-ocr-qwen-finetuned

The model and all associated resources are available for download on Hugging Face:

https://huggingface.co/OdiaGenAIOCR/odia-ocr-qwen-finetuned_v2

Limitation

While the model achieves strong performance on word- and line-level printed Odia text, it currently underperforms on long paragraph-level OCR when paragraphs are provided as a single image. One of the primary reasons for this limitation is that the model was not trained on paragraph-scale or multi-line text samples, and therefore has limited exposure to long visual contexts and dense line layouts during fine-tuning.

As a result, the model may exhibit character omissions, line merging, or spacing inconsistencies when processing full paragraphs without prior segmentation. In practical usage, best results are obtained by segmenting paragraph images into lines or smaller regions and merging predictions post-inference.

Conclusion and Futurework

The odia-ocr-qwen-finetuned model demonstrates that fine-tuning on Odia-specific data delivers significant improvements in OCR performance. Achieving 70% accuracy on diverse Odia text images, it sets a high standard for reliability and correctness in real-world applications.

In upcoming releases, we plan to expand the training data to include paragraph-level and full-page Odia text, along with layout-aware preprocessing and decoding strategies. These enhancements are expected to substantially improve robustness and accuracy on long-form Odia documents. 

Contributors

OdiaGenAI OCR Team