OdiaGenAI Releases Gemma Model with Extensive Instructi...

OdiaGenAI Releases Gemma Model with Extensive Instruction Set Fine-Tuning for the Hindi Language

OdiaGenAI Team

Overview

In the new series of Small Language Models (SLM) for Indic languages, OdiaGenAI released Hindi-Gemma-2B-instruct, a 2Billion SFT with 187k large instruction sets in Hindi. 

The Gemma-2B was chosen as a base model due to:

2B versions for CPU and on-device applications.
Efficient tokenizers on Indic languages compared to other LLMs. 

Dataset

We have used a 187k large Hindi instruction set, combining a set of instruction sets for different NLP tasks. One of the advantages of the instruction set is that it enhances the Gemma Hindi model's capability. The dataset is a comprehensive mix of 3 datasets: 
Alpaca 67K
samanantar_100K_hindi
OdiaGenAI’s Hindi 20K QA Pairs dataset. 

Tokenizer

Gemma tokenizer is based on byte-level Byte-Pair-Encoding and is found efficient for Indic languages in comparison to other tokenizers.  

Training

We used NVIDIA RTX A4000 with 16GB memory. The model was trained for 5 epochs. It took 17 hours and 41 minutes to complete the training. The training hyperparameters and shown in Table 1.

Table 1: Training Hyperparameters

Inference

Below are a few inference samples for the model.

Table 2: Good quality outputs

 
Table 3: Outputs with hallucination and errors

License

The Hindi-Gemma-2B-instruct released with cc-by-nc-4.0 license. 

Availability

The  model is available in Hugging Face at: 

https://huggingface.co/OdiaGenAI-LLM/Hindi-Gemma-2B-instruct

Conclusion

OdiaGenAI released its first Gemma series Indic LLM in Hindi LLM (Hindi-Gemma-2B-instruct). The future work includes: 

More data addition for SFT
⁠Epochs wise analysis 1-10 human and automatic both for Hindi 
multiple langauge capabilities

Contributors

Guneet Singh Kohli
Dr. Shantipriya Parida
Sambit Sekhar
Debasish Dhal

Acknowledgement

We express our gratitude to Dr. Prasad Reddy, Data Care LLC, USA and his team for providing the necessary infrastructure support.