OdiaGenAI Released Llama2-Fine-tuned Model for Odia  


Author: OdiaGenAI Team

broken image


The multilingual landscape of NLP has witnessed significant advancements in recent years, with several models tailored to support a myriad of languages. However, a noticeable gap in the digital representation persists when it comes to providing comprehensive support for Indic languages, including Odia. This gap fuels our dedication to the cause – empowering the Odia language through robust and dedicated language models.

Odia, an Indic language spoken predominantly in the Indian state of Odisha, boasts a rich cultural heritage and a growing digital presence. However, it has long been underserved in the digital sphere, and the need to empower it through enhanced language representation has never been more pressing. Although only some multilingual LLMs support the Odia language due to the amount and quality of data, the performance of these LLMs could be better, as shown in Figure 1.

broken image

Fig. 1. The queries to ChatGPT3.5 about preparing the famous Odia sweet: Rasgulla. The answer from ChatGPT3.5 is not relevant and noisy.


The dataset encapsulates a wealth of linguistic diversity, domain-specific knowledge, and an expansive coverage of the Odia language landscape. The dataset is a versatile repository that encompasses two distinct types of data:

Translated Instruction Set: The data was obtained from various internet sources, primarily blogs and Wikipedia. They were translated using the Indic Trans library under the supervision of colloquial speakers of the Odia language, who edited any discrepancies manually. These instructions serve as valuable training data for Llama2, enabling it to understand and generate Odia text that remains faithful to the nuances and subtleties of the language.

Domain Knowledge: The instructions cover a wide array of subjects, ranging from the culinary delights in RECIPES, the historical significance of HISTORICAL PLACES, and TEMPLES OF ODISHA, to the intellectual pursuits in ARITHMETIC, HEALTH, and GEOGRAPHY. It also explores the artistic tapestry of Odisha through ART AND CULTURE, which celebrates renowned figures in FAMOUS ODIA POETS/WRITERS, and FAMOUS ODIA POLITICAL LEADERS. Furthermore, it encapsulates SPORTS and the GENERAL KNOWLEDGE OF ODISHA, providing an all-encompassing representation of the state. These instructions reflect Odisha’s rich heritage and are a practical and engaging resource for building a conversational AI that resonates with the region’s people. The statistics of the dataset are shown in Table 1. 

broken image

Table 1. Details of the data used in the instruction fine-tuning stage.


The training environment has been trained on NVIDIA A-100 PCI Express, 40 GB single core because PCI Express boasts exceptional computational power, enabling it to handle the complex mathematical operations required during the training of deep neural networks, such as the Llama2 model. Its advanced architecture allows for rapid parallel processing, significantly accelerating model training. The model completed its first training in approx 2.5 days. The training hyperparameters are shown in Table 2.

broken image

Table 2. Training Hyperparameters Inference


The inferencescript is adapted from Alpaca-LoRA considering the base model Llama2-7b withthebuiltmodelweights.

Text Generation Setup

ThedecodingprocessofLLMsplaysacriticalroleindetermining thequality and diversity of the generated text. In our experiments, we use the following decoding hyperparameters:

  • Size of the context: We stablish the context size as 2048, determining the maximum number of tokens that the model can take into account simultaneously during the text generation process.
  • Maximum sequence length: We impose a constraint on the generated sequence length, limiting it to 512 tokens to ensure that the outputs remain focused and closely related to the inputprompt.
  • Temperature: We set the temperature to 0.2, regulating the level of randomness in the sampling process. Lower values make the model produce more focused and deterministic outputs, while higher values introduce greater diversity at the expense of coherence.
  • Top-k sampling: For each step, we adopt Top-k sampling with a value of k=40, where by the model selects the subsequent token from the top 40 most probable options. This introduces an element of randomness and diversity in the generatedtext.
  • Top-p sampling: We also employ Top-p sampling with a value of p=0.9. This further enhances diversity by considering a dynamic set of tokens that collectively account for 90% of the probability mass.
  • Repetition penalty: To discourage the model from generating repetitive text, we apply a repetition penalty with a factor of 1.3, penalizing the selection of already chosen tokens.


Evaluation metrics like ROUGE and BLEU were used to assess the quality and performance of the model. ROUGE is a set of metrics commonly used for evaluating the quality of text summaries. It measures the overlap between the words or n-grams (sequences of words) in the generated text and the reference text. ROUGE is widely used to evaluate machine-generated summaries, machine translation, and text-generation tasks. BLEU is designed to assess the adequacy of translations by comparing them to human-generated reference translations. BLEU is a standard metric in machine translation evaluation. We used 280 samples to calculate the BLEU score and the ROUGE score. The BLEU score was 0.6158, and the ROUGE score was 0.6583. The evaluation scores are shown in Table 3.

broken image

Table 3: Automatic Evaluation Scores

Human Evaluation

The human evaluation process is a crucial and multifaceted element in assessing the Odia generative model's performance, adhering to stringent ethical guidelines and user safety concerns. Trained evaluators, possessing expertise in linguistics and a profound understanding of the Odia language, play pivotal roles in this assessment.

Apart from automatic evaluation, we performed a human assessment of the model by the native Odia speakers by asking subjective questions to verify model generative performance and toxicity. We did our analysis on the basis of three metrics: readability, perplexity, and correctness. A graphical analysis of the average is presented in Fig 3.

Three critical metrics often take center stage in text analysis and natural language processing: Readability, Perplexity, and Correctness. Readability, a measure of how easily a text can be comprehended by its intended audience, is pivotal in ensuring clarity and accessibility in written communication. On the other hand, Perplexity is a crucial gauge for assessing the quality of language models, quantifying their predictive accuracy and understanding of language patterns. Lower perplexity values indicate more proficient models. Lastly, Correctness evaluates the accuracy and fidelity of information within the textual content, measuring alignment with factual accuracy and adherence to linguistic rules. These metrics, collectively, empower professionals in fields such as journalism, linguistics, and artificial intelligence to enhance the quality, and reliability of textual content, ultimately advancing the capabilities of language models and text-based applications.

broken image

Fig. 3: The score for human evaluation metrics

In our study, we conducted a comprehensive comparison between our model and ChatGPT 3.5. The results of this comparative analysis are presented in Figure 4, which encapsulates the key findings of our research.


broken image
broken image

Fig 4: The comparative analysis between our model and ChatGPT 3.5.


  • The model still suffers from hallucinations, particularly for long answers. Item Still needs to be able to follow arithmetic problems and critical reasoning.
  • Still answers incorrectly sometimes for the questions related to Odisha, although the model is explicitly trained, including the domain knowledge instruction set.
  • The model often produces additional unnecessary text, after producing the correct answer. The model is not performing well when it comes to generalization and summarizing.Under certain circumstances, the model produces insufficient
    or no output at all. This is especially true when it comes to arithmetic or when the response is supposed to be big.


The dataset and model are available for research and non-commercial purposes through Huggingface.


The future work includes:

  • investigate model limitations
  • to perform an in-depth comparison study of the proposed model with the available multilingual LLMs supporting Odia,
  • exploring distilling step-by-step approach for build smaller model for comparative analysis,
  • release pre-train Odia LLM model following BLOOM specification.