Share This Article
James Ding Nov 21, 2024 01:30
NVIDIA NeMo Curator aids in processing high-quality Vietnamese language data, enhancing language model training through efficient data curation techniques.
Open-source large language models (LLMs) are often proficient in English, but they face challenges with other languages, particularly those in Southeast Asia, due to a scarcity of training data. Addressing this issue, Viettel Solutions, a subsidiary of Viettel Corporation, has adopted NVIDIA’s NeMo Curator to enhance the processing of high-quality Vietnamese language data, as reported by NVIDIA.
LLMs typically excel in English due to abundant training data. However, languages like Vietnamese often lack sufficient data, which affects model performance. NVIDIA’s NeMo Curator offers a solution by enabling the creation of high-quality datasets necessary for training effective language models.
Viettel Solutions has leveraged NeMo Curator to train its Llama 3 ViettelSolution 8B model, now ranking among the top in the VMLU leaderboard. The tool’s GPU-accelerated features, such as deduplication and filtering, have increased model accuracy by 10%, reduced training time by threefold, and decreased dataset size by 60%, according to Tuan Nguyen, Head of Data Analytics at Viettel Solutions.
The data curation process includes downloading datasets from various sources, reformatting Unicode, deduplicating, and applying quality filtering. The datasets include Vietnamese subsets from C4, OSCAR, and Wikipedia, combined into a single dataset for training. NeMo Curator employs heuristic and classifier-based filtering to enhance data quality, ensuring the removal of noise and preserving essential content diversity.
Heuristic filtering removes low-quality content using predefined rules, while classifier-based filtering employs a trained model to identify high and low-quality data. This dual approach ensures that the dataset is both comprehensive and of high quality, crucial for effective language model training.
The curation process significantly reduces dataset size by removing low-quality and redundant content, with classifier-based filtering alone accounting for a 45% reduction. This efficient filtering ensures that the remaining data is of the highest quality, suitable for pretraining language models.
NVIDIA’s NeMo Curator provides a robust tool for processing high-quality Vietnamese language data, enhancing the performance of language models. By improving data quality and efficiency, it supports Viettel Solutions’ goal of leading in generative AI and developing AI-powered products for the Vietnamese market.
11/20/2024 8:38:18 AM
11/20/2024 8:30:00 AM
11/20/2024 8:24:15 AM
11/20/2024 8:16:53 AM
11/20/2024 8:16:19 AM
Email us at info@blockchain.news
Welcome to your premier source for the latest in AI, cryptocurrency, blockchain, and AI search tools—driving tomorrow’s innovations today.
Disclaimer: Blockchain.news provides content for informational purposes only. In no event shall blockchain.news be responsible for any direct, indirect, incidental, or consequential damages arising from the use of, or inability to use, the information provided. This includes, but is not limited to, any loss or damage resulting from decisions made based on the content. Readers should conduct their own research and consult professionals before making financial decisions.