OdiaGenAI Released the First LLM for the Low Resource Odia Language

Author: OdiaGenAI Team

broken image

Overview

Large Language Models (LLMs) are significantly impacting the AI community, and the advent of ChatGPT and GPT-4 leads to rethinking the possibilities of artificial general intelligence (AGI). However, most of the LLMs are trained in English and other high-resource languages resulting unavailability of LLM and its related technologies and services for many low-resource languages.

The Odia Generative AI (in short, OdiaGenAI) is an initiative by a group of researchers to research Generative AI and Large Language Models (LLMs) for the low-resource Odia language. It is supported by Odisha.ML, the machine learning global community of Odias.

OdiaGenAI initiative aims to utilize the power of AI in building Generative AI and LLM-based technologies and solutions for the overall development of Odisha and Odia language through collaboration among Odia technologists.

The motivation for the OdiaGenAI initiative are:

  • Though many LLMs support multilingual, including Odia language, the performance for various tasks (e.g., content generation, question-answering) could be improved due to the amount of ingested data for Odia. Also, the LLMs supporting the Odia language fail to understand the context, as shown in Fig. 1 and 2.
  • There is subscription or fees associated with the high-performing LLMs.
  • The usage (privacy) and bias of data input to these LLMs are in question.
Figure 1: Queries about “what are the famous foods of Berhampur (a town in Odisha). The answer says it has no idea about it and generated text irrelevant to the question.

Figure 1: Queries about “what are the famous foods of Berhampur (a town in Odisha). The answer says it has no idea about it and generated text irrelevant to the question.

broken image

Figure 2: Asking about “preparing a 3 days travel plan for Odisha”. The answer is no, as expected, and generated text is generally unorganized and specific.  

OdiaGenAI is working on building Pre-trained, Fine-Tuned, and Instructed LLMs for the low-resource Odia language. The models, codes, and dataset are freely available for non-commercial and research purposes.

Data

The dataset containing 91K Odia instruction sets from i) Odia translated 41k Open Sourced Stanford Alpaca, and ii) 50K Odia question-answer pair from AI4Bharat-IndicQA were used for training. For translating the Standford Alpaca instruction set from English to Odia, we used the IndicTrans machine translation library from AI4Bharat.

Training

The first experimental model trained on GPU was provided through Colab Pro+ for 2 epochs following the Alpaca-LoRA training script. The training parameters are shown in Table 1.

broken image

Table 1: Training Hyperparameters

Model

The first experimental model, odiagenAI-model-v0, was released on 12th April 2023 through HugginFace with a CC BY-NC-SA 4-0 license. The model is based on the Llama-7b as the base model and finetuned with the Odia translated instruction set with 2 epochs. The Hugging Face model card shows the model descriptions and running instructions. The code (translation, training, and inference) is available on GitHub. 

Inference

The inference script is adapted from Alpaca-LoRA considering the base model Llama-7b with the odia-genAI-model-v0 weights, as shown in Fig. 3.

broken image

Figure 3: Inference Code

The inference prompt accepts the input in Odia text and output Odia text—the generated output with 2 epochs is shown in Fig. 4.

broken image
broken image

Figure 4: Sample Inference. The question is, “How many days in a week” The answer is not as expected. The output instructs for proper instruction.

Analysis

Although the fine-tuned model output is yet to meet the expectation, the data preparation pipeline and training and inference scripts were tested for further experiments during the experimental model building.

Future Plan

The future plan includes:

i) Continuing fine-tuning with more instruction sets with validated Odia data and base LLM as LLaMA and BLOOM,

ii) Release pre-train Odia LLM model following

BLOOM specification,

iii) F

ine-tuning LLM for the specific domain and releasing Odia chatbots (education, public health, tourism, governance) for general usage.  

Acknowledgment

We thank the following institutions/organizations for their LLM resources and support.

Team

broken image

 

Feel free to contact us for any feedback/suggestions/contributions.