Overview
Generative artificial intelligence is indeed changing many aspects of our lives and transforming the way we interact with technology. Large Language Models (LLMs) are significantly impacting the AI community, and the advent of ChatGPT and GPT-4 leads to rethinking the possibilities of artificial general intelligence (AGI).
However, most of the LLMs are trained in English and other high-resource languages resulting unavailability of LLM and its related technologies and services for many low-resource languages. Though many LLMs support multilingual, including Odia language, the performance for various tasks (e.g., content generation, question-answering) could be improved due to the amount of ingested data for Odia.
In an attempt to build the Generative AI and LLM-based technologies for the Odia language, OdiaGenAI, released "Olive", an instruction following LLaMA model for the low-resource Odia language on 19th May 2023.
The dataset, and code (training/inference) are available freely for research and non-commercial purposes. The datasets and models are available on Odia Generative AI Hugging Face page.
Dataset
The dataset contains 171K Odia instruction sets. The instruction sets are prepared by:
i) translating popular instruction sets (as below) from English to Odia using the IndicTrans machine translation library from AI4Bharat.
- Alpaca
- Dolly
- GPT Teacher
ii) preparing translation instruction set using OdiEnCorp, English-Odia parallel corpus
iii) hard-coded instruction set
Training
The first experimental model trained on GPU (A100, 40G VRAM) was provided through Colab Pro+ for 3 epochs following the Alpaca-LoRA training script. The training parameters are shown in Table 1.
Table 1: Training Hyperparameters
The training took more than 30 hours, cost 4200 INR. We need to restart the training in between due to issues in saving checkpoints which can be seen in the train/eval curve.
Figure 1: Train/Eval Loss curve
The first OdiaGPT model, odiagenAI-model-v1, was released on 19th May 2023 through HugginFace with a CC BY-NC-SA 4-0 license. The model is based on Llama-7b as the base model and finetuned with the Odia instruction set with 3 epochs. The Hugging Face model card shows the model descriptions and running instructions. The code (translation, training, and inference) is available on GitHub.
Inference
The inference script is adapted from Alpaca-LoRA considering the base model Llama-7b with the odia-genAI-model-v1 weights. The sample inferences are shown in Figure 2-5.
Figure 2: Sample Inference. The question is, “Who are you” The answer is “I am Olive, known as a chat assistant and trained by the researchers of OdiaGenAI “. Note: The last sentence is repeated.
Figure 3: Sample Inference. The question is, “Which are the main cities of India” The answer is “Some of the main cities of India are NewDelhi, Mumbai, Chennai, Bengaluru, and Kolkata,” Note: The few words not correct that talk about population growth in the city of Chennai and Hindu (wrong city name).
Figure 4: Sample Inference. The question is, “Who is the prime minister of India” The answer is “Narendra Modi is the prime minister of India”.
Figure 5: Sample Inference. The question is, “Write a poem in Odia” The answer is “I apologize, I am an artificial intelligence model and I don't have the capability to write a poem but I write a poem that can be written in Odia”.
Analysis
- The model is able to follow Odia instructions and generate content in Odia.
- The model provides correct answers relating to the general knowledge questions about India.
- The model still suffers from hallucinations.
- Due to the lack of Odisha-related context data, the model fails to answer questions relating to Odisha.
- Still not able to follow arithmetic problems and critical reasoning.
Future Plan
The plan includes:
i) Fine-tuning with more instruction sets containing knowledge about Odisha and its local context (literature, food, places, persons, festivals, history, politics, etc.), arithmetic, and critical reasoning.
ii) Continuing fine-tuning with more instruction sets with validated Odia data and base LLM as larger open-source models supporting Odia.
iii) Release pre-train Odia LLM model following BLOOM specification,
iv) Fine-tuning LLM for the specific domain and releasing Odia chatbots (education, public health, tourism, governance) for general usage.
Acknowledgment
We thank the following institutions/organizations for their LLM resources and support.
Team
Feel free to contact us for any feedback/suggestions/contributions.