OdiaLipi-Qwen2.5-VLM

Bridging Vision and Language for Odia OCR

Section image

Overview

OdiaLipi-Qwen2.5-VLM is a vision-language OCR model designed to accurately recognize printed and handwritten Odia text from images and documents. Built on top of the Qwen2.5 Vision-Language architecture, the model bridges visual understanding and linguistic knowledge to address long-standing challenges in Odia OCR, such as complex character shapes, ligatures, and layout variations.

Unlike traditional OCR systems that rely on script-specific heuristics, OdiaLipi-Qwen2.5-VLM leverages multimodal learning to jointly reason over image content and language structure. This enables more robust recognition across diverse document types, including scanned books, historical manuscripts, handwritten notes, and real-world images.

The model has been fine-tuned on merged Odia OCR datasets and evaluated using standard OCR metrics such as Character Error Rate (CER) and Word Error Rate (WER). Early results demonstrate significant improvements over conventional OCR baselines, particularly in scenarios involving degraded scans, handwritten text, and complex layouts.

By making Odia OCR more accurate and accessible, OdiaLipi-Qwen2.5-VLM aims to support large-scale digitization efforts, downstream NLP research, and the preservation of Odia’s rich textual heritage.

Dataset

For OdiaLipi-Qwen2.5-VLM, we used the Odia OCR Merged dataset, a comprehensive multi-source Odia OCR collection curated by the OdiaGenAIOCR community. This dataset merges three major sources to provide a robust foundation of 192,000+ samples for both training and evaluation:

  • OdiaGenAIOCR/Odia-lipi-ocr-data: 64 word-level samples
  • tell2jyoti/odia-handwritten-ocr: 182,152 character-level samples (32x32px)
  • darknight054/indic-mozhi-ocr (Odia subset): 10,000+ printed word images

Dataset composition:

  • 🖨️ Printed Odia text images — scans from books, articles, and documents with varied fonts and layouts
  • ✍️ Handwritten Odia samples — real handwritten pages and notes
  • 🧩 Mixed document styles — clear prints, noisy scans, uneven lighting, and diverse backgrounds

All images are paired with ground-truth Odia Unicode transcriptions, enabling supervised OCR training. The dataset includes all 47 Odia characters (vowels, consonants, and special marks), ensuring comprehensive script coverage.

Data splits:

  • Training: 58,720 examples
  • Validation: 7,340 examples
  • Test: 7,340 examples

This diverse and well-annotated dataset makes OdiaLipi-Qwen2.5-VLM robust for printed and handwritten text recognition, document-level OCR, and downstream applications like text digitization and language modeling.

Model

The odia-ocr-qwen-finetuned is a vision-language model fine-tuned on the Qwen2.5-VL-3B-Instruct architecture for high-accuracy Odia OCR. Trained on 58,720 validated text-image pairs from the Odia OCR Merged dataset, it bridges visual understanding and Odia language generation to accurately recognize both printed and handwritten Odia text.

Architecture & Features

  • Base Model: Qwen2.5-VL-3B-Instruct (3.78B parameters)
  • Type: Vision-Language Model (VLM) – multimodal
  • Precision: bfloat16 (mixed precision)
  • Vision Encoder: Extracts spatial and textural features from document images
  • Language Decoder: Generates Odia Unicode text, handling ligatures, conjunct consonants, and special marks
  • Cross-Modal Attention: Aligns visual input with Odia tokens for precise transcription

Training Highlights

  • Training Samples: 57,565
  • Evaluation Samples: 1,155
  • Batch Size: 1 (gradient accumulation x2)
  • Learning Rate: 2e-4
  • Hardware: NVIDIA A100 (80GB)
  • Training Time: ~4 hours (3 epochs)

Capabilities

  • Recognizes printed and handwritten Odia text from diverse documents
  • Handles complex fonts, ligatures, and mixed layouts
  • Optimized for document digitization, form processing, and archival tasks
  • Outputs Unicode Odia text, ready for downstream NLP or digital preservation pipelines

Performance

  • Character Error Rate (CER): 20–40% (document-dependent)
  • Exact Match Accuracy: 40–70% (post-processed)
  • Post-Processing Success: 100% on validation samples
  • Inference Time: ~30–45 seconds per image on A100 GPU

Availability

The OdiaLipi-Qwen2.5-VLM model is freely accessible for developers, researchers, and organizations looking to integrate high-accuracy Odia OCR into their workflows.'

You can explore and test the model through our live demo (in progress): https://huggingface.co/spaces/OdiaGenAIOCR/odia-ocr-qwen-finetuned

The model and all associated resources are available for download on Hugging Face:

Conclusion

The OdiaLipi-Qwen2.5-VLM model demonstrates that fine-tuning on Odia-specific data delivers significant improvements in OCR performance. Achieving 58% accuracy on diverse Odia text images, it sets a high standard for reliability and correctness in real-world applications.