evolving AI data from good to great – Samsung Newsroom Peru

Samsung Research in Vietnam is part of a series about the people and innovations enabling mobile AI to improve more lives

Samsung is a pioneer in premium mobile AI experiences. To learn how Galaxy AI is maximizing the potential of its users, we visited Samsung Research centers around the world. Now supporting 16 languages, Galaxy AI lets more people expand their language capabilities, even offline, with on-device translation with features like Simultaneous Translation, Interpreter, Note Assistant, and Navigation Assistant. Recently, we visited Jordan to learn the intricacies of developing an AI model for Arabic, a language with many dialects. This time, we are going to Vietnam to explore how data is prepared to train AI models.

What is the difference between ghost, grave and mother in Vietnamese? For a language spoken by 97 million people around the world, it is very little. Each word is translated as “ma”, “mả” and “má”, respectively, and can only be distinguished by tone. This illustrates how difficult it can be for AI models to learn a language, considering that they cannot recognize first-hand the context and emotions of conversations or the intentions of those speaking.

Samsung Research and Development Institute (SRV) in Vietnam used finely vetted data to help its AI model properly recognize even the most subtle differences in language.

The quality of the data used directly affects the accuracy of automatic speech recognition (ASR), neural machine translation (NMT), and text-to-speech (TTS), processes that help the functions of Galaxy AI as Simultaneous Translation, Interpreter, Writing Assistant and Navigation Assistant to break language barriers.

A typhoon of challenges

“Vietnamese is a complex and diverse language with rich expressions, many of which are difficult to capture,” says Ngô Hồng Thái, NMT leader at SRV. Of the 16 languages Galaxy AI supports, Vietnamese was particularly difficult to develop.

“Personally, creating an AI model for the Vietnamese language was more challenging than our typhoons!” he adds before explaining the obstacles faced during the development process.

Vietnamese is a tonal language with six different tones. As evident in the “ma” example above, small nuances in vocalization can drastically alter the meaning of words. Therefore, a meticulous and detailed approach was necessary.

“When breaking down similar-sounding words, a word consists of several short segments or ‘sets of frames,’” says Bui Ngoc Tung, ASR lead at SRV. “The AI model differentiates between short audio frames of around 20 milliseconds to recognize which words correspond to a given set of consecutive frames. As such, it is essential to put a lot of effort into the early stages of the AI learning process.”

Additionally, homophones and homonyms are common in Vietnamese. Typically, people can rely on context and nonverbal elements in conversations to differentiate between words that sound or are spelled the same but have different meanings. However, AI models need to be taught to accurately identify and differentiate between similar tones and words.

“This is not a simple task,” explains Thái. “Apart from the quantity, the data must be accurate to ensure that they are able to recognize the linguistic nuances that exist in Vietnamese.”

Rigorous preparation

The data refinement process consists of three steps. First, the audio and text used to train the AI model must be reviewed and corrected. This data set then goes through random overall quality checks. Finally, the data set is normalized and cleaned before use in training.

“We thoroughly conducted a series of tests to check the accuracy of our dataset,” says Nguyen Manh Duy, TTS leader at SRV, who oversees database creation. “We faced a number of unexpected problems, such as misspelled words in scripts and background noise or incorrect pronunciation during audio recordings. “We spend a lot of time refining and improving our training data.”

In addition to the unique linguistic challenges of Vietnamese, there is a lack of universally accessible data compared to most widely spoken languages. “This is another reason why the data refinement stage is so important,” he adds. “As we had limited sources, every piece of information had to be completely reliable. “There was no margin for error.”

Additionally, the AI model for Vietnamese must consider both tone and regional differences. To improve the accuracy of the AI model, the team collected large amounts of data using the accents of northern, central and southern Vietnam, resulting in a huge amount of information to refine and verify.

Continuous improvement

SRV developers completed the project after months of hard work and Vietnamese became one of the first languages supported by Galaxy AI. Despite this success, the team works tirelessly to improve the language experience.

“We continue to improve the AI model by incorporating user feedback on the relevance of words and phrases in Galaxy AI,” says Tran Tuan Minh, AI language development project leader at SRV. “We have just taken our first steps towards a more open world and we have much more to explore together.”