OdiaGenAI Released an Instruction Following LLaMA Model for Bengali

Author: OdiaGenAI Team

broken image

Overview

OdiaGenAI released a new BengaliGPT model as part of its initiative to build Generative AI and LLM-based technologies for Odia and Indic languages.

The BengaliGPT model is based on Llama-7b and finetuned with a 252k Bengali instruction set. The instruction set is translated data from open-source resources, resulting in good Bengali instruction understanding and response generation capabilities.

Data

This dataset is a mix of Bengali instruction sets translated from open-source instruction sets:

  • Dolly,
  • Alpaca,
  • ChatDoctor,
  • Roleplay
  • GSM

In this dataset Bengali instruction, input, and output strings are available. The instruction set is available at Hugging Face for research and non-commercial purposes.

Training

The first experimental model was trained on GPU for 5 epochs following the Alpaca-LoRA training script. The training parameters are shown in Table 1 and loss curve in Fig. 1.

broken image

Table 1: Training Hyperparameters

broken image

 

broken image

Figure 1: Train/Eval Loss curve

Model

The BengaliGPT model, odiagenAI-bengali-lora-model-v1, was released on 10th June 2023 through Hugging Face with a CC BY-NC-SA 4-0 license. The model is based on Llama-7b as the base model and finetuned with the Bengali translated instruction set with 5 epochs. The Hugging Face model card shows the model descriptions and running instructions. The code (translation, training, and inference) is available on GitHub.

Inference

The inference script is adapted from Alpaca-LoRA considering the base model Llama-7b with the odiagenAI-bengali-lora-model-v1 weights.

The inference prompt accepts the input in Bengali text and outputs Bengali text—the generated output. The text-to-speech is integrated, so, the output is converted to speech.

broken image
broken image

Figure 2: Sample Inference. The question is, “What is the sum of 10 plus 20?” The answer is “The sum of 10 plus 20 = 10 + 20 = 30 and the sum of 10 plus 20 can be expressed as a number with 30”

broken image
broken image

Figure 3: Sample Inference. The question is, “What are the benefits of eating an apple a day?” The answer is “Benefits of eating an apple a day Apples are a healthy and wholesome food to eat.”

broken image
broken image

Figure 4: Sample Inference. The question is, “What is the primary source of energy that causes evaporation of water from the surface of a body of water?” input is, { "text": [ "solar radiation", "conduction by plants", "heat from surrounding land mass", "convection currents in water" ], "label": [ "A", "B", "C", "D" ] } The answer is “Solar radiation by plants is a primary source of energy that causes evaporation of water from the surface of water bodies.”

broken image
broken image

Figure 5: Sample Inference. The question is, "Write python code for Fibonacci Series". The answers "The following code can be use to write python code for the Fibonacci Series [python code]" .

How to Use

Bengali Generative AI has just released its latest model, and the best part is that anyone can now access and use it on Colab, free of charge! This powerful language model is trained to generate text in the Bengali language, allowing users to create Bengali content, generate creative writing, or even build Bengali language-based applications.

But that's not all. The Bengali Generative AI's latest model also comes with an integrated Text-To-Speech (TTS) feature. This means that not only can you generate text in Bengali, but you can also have that text converted into natural-sounding speech. It opens up a whole new range of possibilities for audio content creation, language learning, accessibility, and more. Running the model on Colab is a breeze.

Just follow these simple steps:

  • Step1: Open the link to the Bengali Generative AI model on Colab: [https://colab.research.google.com/drive/1HYHZJwsNWk9auZ_o39G3AIGMtGkqVG2o?usp=sharing].
  • Step2: In Colab, navigate to "Runtime" and select "Run all cells". This will initiate the model and load all the necessary dependencies.
  • Step3: Once the cells have finished running, you will be provided with a Gradio URL. Gradio is a user-friendly interface that allows you to interact with the model effortlessly.
  • Step4: Click on the Gradio URL, and it will open a web interface where you can input your desired text in Bengali
broken image

Figure 6: Gradio URL for Inference

You can choose to generate text or have it converted into speech using the integrated TTS feature. Play around with the model, generate Odia text, and listen to the TTS output. Explore its capabilities and get creative with your ideas.

Analysis

Although the current model is able to accept Bengali input text and generate answers in Bengali, it still fails to answer questions related to general knowledge about India due to a lack of domain knowledge. Also it fails in critical reasoning as per evaluation.

Future Plan

The plan includes i) fine-tuning with a more domain-specific Bengali instruction set, ii) Pre-train Bengali LLM model, iii) Chatbot development supporting Bengali.

Acknowledgment

We thank the following institutions/organizations for their LLM resources and support.

OdiaGenAI Team