OdiaGenAI Releases Gemma Model with Extensive Instruction Set Fine-Tuning for the Hindi Language

OdiaGenAI Team

broken image


In the new series of Small Language Models (SLM) for Indic languages, OdiaGenAI released Hindi-Gemma-2B-instruct, a 2Billion SFT with 187k large instruction sets in Hindi.

The Gemma-2B was chosen as a base model due to:

  • 2B versions for CPU and on-device applications.
  • Efficient tokenizers on Indic languages compared to other LLMs.


  • We have used a 187k large Hindi instruction set, combining a set of instruction sets for different NLP tasks. One of the advantages of the instruction set is that it enhances the Gemma Hindi model's capability. The dataset is a comprehensive mix of 3 datasets:
    • Alpaca 67K
    • samanantar_100K_hindi
    • OdiaGenAI’s Hindi 20K QA Pairs dataset.


Gemma tokenizer is based on byte-level Byte-Pair-Encoding and is found efficient for Indic languages in comparison to other tokenizers.


We used NVIDIA RTX A4000 with 16GB memory. The model was trained for 5 epochs. It took 17 hours and 41 minutes to complete the training. The training hyperparameters and shown in Table 1.

broken image

Table 1: Training Hyperparameters


Below are a few inference samples for the model.

broken image

Table 2: Good quality outputs

broken image

Table 3: Outputs with hallucination and errors


The Hindi-Gemma-2B-instruct released with cc-by-nc-4.0 license.


The model is available in Hugging Face at: 


OdiaGenAI released its first Gemma series Indic LLM in Hindi LLM (Hindi-Gemma-2B-instruct). The future work includes:

  • More data addition for SFT
  • ⁠Epochs wise analysis 1-10 human and automatic both for Hindi
  • multiple langauge capabilities



We express our gratitude to Dr. Prasad Reddy, Data Care LLC, USA and his team for providing the necessary infrastructure support.