How did we train Vietcuna? 

From data collection to training procedure

Sep 9th, 2023

Vietcuna was launched by ViLM in August 2023 as an effort to democratize AI for the underrepresented. As the first-ever Open Source Vietnamese Large Language Model, Vietcuna has achieved incredible numbers with 36,000+ users on our Web Playground and 12,000+ downloads on HuggingFace.

As part of our release, we also announce our full pipeline of training Vietcuna as an example for others to reproduce in other languages and use cases.

Choosing a base model

With limited resource in GPUs, we limited ourselves to 7B parameters and below models. This leaves us with just a few choices: LLaMA, GPT-J, GPT-2 and BLOOM/BLOOMZ.

Since the Vietnamese language is the main focus of Vietcuna, we further investigated the tokenizers and the number of tokens in Vietnamese each of the above models have been pre-trained on. We quickly eliminated LLaMA and GPT-J from our candiates as LLaMA was heavily pre-trained on English while GPT-J does not have too much Vietnamese token during the pre-trained process. Finally, we decided to go with BLOOMZ as it offer more configurations (3B and 7B) compared to GPT-2. BLOOMZ was chosen over BLOOM because it was fine-tuned on multi-task instructional data, making it a better base model.

It's hard to find a good tokenizer that would work for all languages

Continued Pre-training

Most of the knowledges are learned by LLMs during the pre-training phase, while the fine-tuning phase are more suitable for more specific use cases (like instructional questioning & answering). Therefore, we collect more publicly available data in Vietnamese. We initiated the process of crawling news data from Vietnamese online news sites like VnExpress, Zing News, BaoMoi, etc. We ended up with 12GB in Vietnamese raw text.

As we limited ourselves to limited GPU resources (GPU poor), we adopted a new technique in pre-training with low-rank adaptation called ReLoRA (Lialin et al., 2023). By pre-training the model with a low-rank update for several fixed steps before merging and restarting it then repeat the process multiple times, ReLORA allows the model to learn more knowledge in Vietnamese through many low-rank updates instead of a full fine-tuning. This results in much lower requirements in GPU to pre-train an LLM.

Vietcuna Continued Pre-training Process

Instructional & Conversational Fine-tuning

We followed popular data augmentation techniques including Evol-Instruct, ShareGPT, Orca, Chain-of-Thought, and Alpaca. We translated 30% of those datasets to Vietnamese via Google Translate API and then used Google's text-bison-001 to further augment our synthetic data. The complete fine-tuned dataset includes 200K samples of instructional question and answer pairs along with 400K samples of conversations.

Finally, we chose QLoRA (Dettmers et al, 2023) as our fine-tuning method since it is applicable for fine-tuning with low-resource hardware. However, we applied QLoRA to all linear layers to help the LLM adapt to more complex tasks instead of just the attention layers.

Vietcuna Finetuning Pipeline

With the introduction of Vietcuna, we not only usher in a new era of Vietnamese technological advancement but also make strides in bridging the AI representation gap. Beyond being a mere language model, Vietcuna stands as a powerful tool empowering Vietnamese enterprises to navigate unexplored terrains of innovation. By fine-tuning this model to embrace our unique linguistic and cultural shades, we envision Vietcuna as a driving force for crafting tech solutions resonating with our local community and extending beyond our borders. Moreover, with the open-sourcing of Vietcuna and its data pipeline, we aspire to set a blueprint for other languages, catalyzing a quicker and more efficient path to develop their own bespoke LLMs. Together, let's democratize AI access and reshape the global tech narrative, language by language.