OdiaGenAI Releases Llama3 Fine-tuned Model for the Odia...

OdiaGenAI Releases Llama3 Fine-tuned Model for the Odia Language

Author: OdiaGenAI Team

Overview

OdiaGenAI recently launched a fine-tuned Llama3-8B model (Llama3_8B_Odia_Unsloth) specifically tailored for the low-resource Odia language. The fine-tuned model is built with Meta Llama3. The decision to select Llama3 stemmed from its remarkable attributes:

Llama3 showcases cutting-edge performance across various industry benchmarks and introduces novel functionalities, notably enhanced reasoning capabilities. 
Its pretraining dataset spans high-quality non-English data encompassing over 30 languages (although not including any Indic languages).
Notably, previous iterations of Llama LLMs exhibited superior performance within the Indic languages, including Odia.

Dataset

The dataset comprises of 171k instruction set includes : i) translated versions of well-known instruction sets, and ii) a curated domain knowledge instruction set encompassing diverse subjects. These include culinary expertise in RECIPES, historical narratives about HISTORICAL PLACES and TEMPLES OF ODISHA, as well as scholarly content spanning ARITHMETIC, HEALTH, and GEOGRAPHY. 

Tokenizer

The tokenizer used is a Byte-Pair-Encoding (BPE) model based on tiktoken. The tokenization efficiency of our model was tested on a set of texts as shown in Fig 1.

Figure 1: Token count of given text based on different tokenizers

As evident, the tokenization for our current model (Llama-3-8B) is not as efficient as tokenizers of models dedicated to Indic languages, like Indic-Bert. Although, the performance of our current model has been more coherent and better than most of our other models. The pre-training data for Llama-3-8B didn’t have any Odia text.

Training Stats

We used the Unsloth library for fine-tuning which allows the training faster and is one of the popular choices for the Llama-3 fine-tuning. We used "unsloth/llama-3-8b-bnb-4bit" for fine-tuning. We used one GPU (NVIDIA A100-SXM4-40GB) provided by Colab Pro+ for training. The training hyperparameters are shown in Table 1. 

Table 1: Training Hyperparameters

The training script is available at Github. 

Inference

The inference script is adapted from Unsloth. The sample inference shown here. 

Evaluation

Some sample outputs are shown in Table 2 and Table 3.

Table 2: Some Good Quality Output

Table 3: Outputs with Hallucination and Errors

License

The Llama3_8B_Odia_Unsloth comes with a Llama3 license.

Availability

The  dataset and model available in Hugging Face at: 

Dataset: https://huggingface.co/datasets/OdiaGenAI/all_combined_odia_171k

Model: https://huggingface.co/OdiaGenAI-LLM/Llama3_8B_Odia_Unsloth

Conclusion

OdiaGenAI released its Llama3-8B fine-tuned LLM for the Odia language. The Unsloth library accelerates the fine-tuning process. The fine-tuned Llama3_8B_Odia_Unsloth will help researchers to explore more on fine-tuning Llama3 LLM for Indic languages.

The future work includes:

More data (including domain and culture) addition for SFT
⁠Training more epochs and epoch-wise analysis 
Extend to other Indic languages

Limitations

The manual assessment identified the following drawbacks:

Hallucination: The model occasionally generates erroneous information alongside correct responses.
Incorrect Information: Despite improvements, the model continues to struggle with queries related to Odisha and its cultural aspects. 

Contributors

Dr. Shantipriya Parida
Sambit Sekhar
Debasish Dhal
Shakshi Panwar