OdiaGenAI Releases Gemma Model with Extensive Instruction Set Fine-Tuning for the Hindi Language

OdiaGenAI Team

broken image

Overview

In the new series of Small Language Models (SLM) for Indic languages, OdiaGenAI released Hindi-Gemma-2B-instruct, a 2Billion SFT with 187k large instruction sets in Hindi.

The Gemma-2B was chosen as a base model due to:

  • 2B versions for CPU and on-device applications.
  • Efficient tokenizers on Indic languages compared to other LLMs.

Dataset

  • We have used a 187k large Hindi instruction set, combining a set of instruction sets for different NLP tasks. One of the advantages of the instruction set is that it enhances the Gemma Hindi model's capability. The dataset is a comprehensive mix of 3 datasets:
    • Alpaca 67K
    • samanantar_100K_hindi
    • OdiaGenAI’s Hindi 20K QA Pairs dataset.

Tokenizer

Gemma tokenizer is based on byte-level Byte-Pair-Encoding and is found efficient for Indic languages in comparison to other tokenizers.

Training

We used NVIDIA RTX A4000 with 16GB memory. The model was trained for 5 epochs. It took 17 hours and 41 minutes to complete the training. The training hyperparameters and shown in Table 1.

broken image

Table 1: Training Hyperparameters

Inference

Below are a few inference samples for the model.

broken image

Table 2: Good quality outputs

broken image

 
Table 3: Outputs with hallucination and errors

License

The Hindi-Gemma-2B-instruct released with cc-by-nc-4.0 license.

Availability

The model is available in Hugging Face at: 

Conclusion

OdiaGenAI released its first Gemma series Indic LLM in Hindi LLM (Hindi-Gemma-2B-instruct). The future work includes:

  • More data addition for SFT
  • ⁠Epochs wise analysis 1-10 human and automatic both for Hindi
  • multiple langauge capabilities

Contributors

Acknowledgement

We express our gratitude to Dr. Prasad Reddy, Data Care LLC, USA and his team for providing the necessary infrastructure support.