Introducing OdiaGenAI's Qwen-Based Pre-trained LLM for Odia Language

Author: OdiaGenAI Team

broken image

Overview

OdiaGenAI has launched its first pre-trained Odia Large Language Model (LLM), Qwen_1.5_Odia_7B, as part of its effort to develop Generative AI and LLM-based technologies for Odia and other Indic languages. This pre-trained Odia LLM is built on the foundation of Qwen_1.5-7B, chosen for its impressive performance across various evaluation benchmarks. Specifically, Qwen_1.5-72B demonstrates superior performance compared to Llama2-70B across all benchmarks, highlighting its exceptional language understanding, reasoning, and mathematical abilities.

Dataset

We utilized the Culturex-Odia dataset, which is a refined edition of the initial CulturaX dataset designed for Odia text processing. This training dataset encompasses 49 million tokens and is derived from a combination of mC4 and four separate OSCAR corpora to form the Culturex-Odia dataset.

Tokenizer

Qwen-7B uses BPE tokenization on UTF-8 bytes using the tiktoken package. There are two types of tokens in Qwen-7B, i.e., the regular tokens (of type bytes) in BPE and the special/control tokens (of type str). For the tokenizer, you cannot directly expand the vocabulary by adding words to the vocabulary, and intermediate merges are needed for tokenization. We did not perform vocabulary expansion.

We performed an comparison study of various tokenizers for Odia, English and Odia-English mixed text as shown in Fig 1.

broken image
broken image

Figure 1: Token count of given text based on different tokenizers

We used AMD Instinct and AMD EPYC powered high performance computing servers. The model was trained for 10 epochs. It took 8 days, and 3 hours to complete the full training. The training hyperparameters and shown in Table 1 and learning curves in the Fig 2 and 3.  

broken image

Table 1: Training Hyperparameters

broken image

Figure 2: Training Loss Curve

broken image

Figure 3: Learning Rate Curve

Inference

Below are a few inference samples for the model.

broken image

Table 2: Good quality outputs. The incomplete sentences are a result of the max token length that we set during inference.

broken image

Table 3: Outputs with hallucination and errors

Evaluation

The model evalution on different bencmarks in progress.

License

The Qwen_1.5_Odia_7b comes with an apache-2.0 license which allows for commercial use.

Availability

The model is available is Hugging Face at: https://huggingface.co/OdiaGenAI-LLM/qwen_1.5_odia_7b

Conclusion

The released pre-trained model required to fine-tune on specific task before using because this is not an instruction-tuned model, so it may not be able to follow human instructions without using one/few-shot learning or instruction fine-tuning. The model has no moderation mechanisms and may generate harmful or inappropriate responses. So, it is higly recommended to first fine-tune it on the task(s) you are interested in.

The Instruct model (chat-model) will be released soon.

Limitations

The manual evaluation observed the following limitations:

  • Hallucination - As shown in the inference Table 3, there are instances when the model hallucinates after producing the correct response.
  • Tokenization - As demonstrated in the tokenization graph (Figure 1), our tokenizer is almost 10 times more efficient in the English language, as compared to the Odia language. Effectively, the model is producing one character after another in Odia, while it is producing one word after another in the the English language. This puts a lot of pressure on the model, and it makes it more prone to errors.
  • Insufficient domain knowledge - Since the model is a base model, its domain knowledge is fairly limited.
  • Abrupt stopping - The inference process is stopping in the midst of producing the correct response.

Contributors

  • Sambit Sekhar
  • Dr. Shantipriya Parida
  • Debasish Dhal
  • Guneet Singh Kohli

Acknowledgement

We are thankful to AMD for the necessary infrastructure support.