Score twice! NetEase Dun's two papers were re-selected into the phonetic academic Conference INTERSPEECH 2023. 04/19 Update SLTechnology News&Howtos

Score twice! NetEase Dun's two papers were re-selected into the phonetic academic Conference INTERSPEECH 2023.

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

On August 20-24, INTERSPEECH 2023, the world's largest technology event in the field of integrated voice, was held in Dublin, Ireland. NetEase Dun two academic papers were officially hired by INTERSPEECH to share academic research results with the world's top academic circles.

This is the first time that the paper has been accepted by the top international academic conference after ICASSP by the AI team of Yi Dun. At this point, NetEase Shield has unlocked the achievements of the recruitment of all the two top phonetic academic papers in the world.

INTERSPEECH enjoys a high reputation in the world and has extensive academic influence. It is a flagship international conference founded by the International Association for Voice Communications (ISCA), and it is also the world's largest scientific and technological event in the field of comprehensive voice signal processing. It covers many fields such as speech recognition, speech synthesis, speech enhancement, natural language processing and so on, and attracts thousands of scholars, engineers and entrepreneurs from all over the world to participate in the exchange and exhibition every year.

According to INTERSPEECH 2023, thousands of people attended the meeting from dozens of countries around the world, including China, the United States, Japan, the United Kingdom, France, Germany, India and so on. The conference received more than 3000 papers from the world's top laboratories, top universities and top research teams. NetEase Yi Dun selected two papers on INTERSPEECH 2023, with the topics of "Enhancing the Unified Streaming and Non-streaming Model with Contrastive Learning" and "Language-Routing Mixture of Experts for Multilingual and Code-Switching Speech Recognition" respectively.

At the same time, the Yi Dun AI algorithm research team also went to the scene to meet with researchers to answer questions and demonstrate our latest voice technology. This contributes to the research level of global phonetic academia and provides a reference for academic exchanges among researchers.

01.

"Hello Jarvis"? Voice AI can be achieved!

"Hello, Jarvis."

"at your service, sir."

In the movie "the Avengers," Iron Man and his AI smart butler Jarvis show us a future smart home scene full of technological ideas. At that time, we may marvel at the surreal concept depiction of the film, but today, with the continuous progress of artificial intelligence (AI,Artificial Intelligence) technology, the vision of sci-fi movies into reality is not out of reach.

In the movie, the communication and collaboration between Iron Man and his AI intelligent assistant Jarvis is all achieved through voice conversation. Therefore, if we want to move such a future scene into reality as soon as possible, the combination and research of speech recognition and artificial intelligence technology is the key. Speech recognition refers to the technology of converting human speech signal into text or command, which involves speech signal processing, natural language processing and other fields. Speech recognition technology allows us to interact with devices such as computers or mobile phones through voice, which improves the efficiency and convenience of input and operation. For example, we can use voice to search for information, send text messages, make phone calls, control smart homes and so on. AI voice assistant such as Jarvis is an intelligent service based on speech recognition technology, which can understand users' voice instructions and provide corresponding services or information according to the needs of users.

Of course, the application of a technological breakthrough and innovation in personal life scenarios is only a part of the value of the technology, and the application of combining technology into enterprise services can maximize the value.

02.

Contrastive Learning, language Transcoding and risk Control of Digital content

Taking two papers by NetEaseDun as an example, we will explain how voice AI technology is applied to digital content risk control service scenarios and enhance value for customer service.

In the Yi Dun intelligent voice detection business scenario, there are real-time (streaming) detection requirements and offline (non-streaming) detection requirements. Streaming / non-streaming integrated model means that a model can meet the identification needs of both streaming and non-streaming scenarios, which reduces the cost of model development, training and deployment. In the actual use process, the performance of the model is still the focus of the AI team. In most scenarios, there are often two performance gaps in the integrated model, as shown in the following figure.

The main results are as follows: (1) the non-flow recognition performance of the integrated model is better than that of flow recognition.

(2) the performance of the pure offline model trained by the completely non-streaming mode is better than that of the offline decoding mode in the integrated model.

Yi Dun AI team hopes that the performance gap between the two is as small as possible. on the one hand, they hope that the effect of flow recognition can be close to that of non-flow recognition, on the other hand, they hope that there will be no performance loss in the integrated model compared with the pure offline model. How to further improve the performance of the integrated model is a challenging problem. From the point of view of model representation, if the flow representation can be closer to non-flow, then the content of flow recognition will be more similar to that of non-flow recognition, which means that the effect of flow recognition can be closer to non-flow recognition.

Based on this motivation, the Yi Dun AI algorithm team proposes to use comparative learning methods to narrow the inherent representation gap between streaming and non-streaming patterns, so as to improve the performance of the integrated model, as shown in the following figure.

The Yi Dun AI algorithm team regards the streaming representation and non-streaming representation of each frame as positive sample pairs, and randomly samples multiple negative samples from other frames of non-streaming mode, using comparative learning to shorten the distance between positive samples and increase the differentiation between negative samples. Through the comparative study of streaming and non-streaming, the training of the two modes is completed at the same time.

The effectiveness of the algorithm is verified in open source data sets and easy shield business scenarios, and the results show that the performance of the integrated model based on comparative learning has been significantly improved. In terms of business data, this method has helped NetEase Yi Dun to improve the effect that would have taken a quarter of data accumulation in the short term.

In addition, in multilingual speech scenes, monolingual speech and code-switching speech containing two or more languages exist widely. Therefore, multilingual speech recognition system needs to support the speech recognition of the above two scenarios at the same time. For this reason, Yi Dun AI team designed a mixed language recognition method by introducing language "routing" mechanism and hybrid expert system (MOE), which is referred to as LR-MoE for short. In the mixed expert module, LR-MoE hands different languages to the corresponding "expert" module for processing, which not only reduces the computational overhead, but also improves the recognition effect of multilingual and mixed languages.

In actual business, users often have the following requirements when using multilingual speech recognition systems:

1. Manually configure languages to achieve speech recognition for specific languages, such as content platforms for specific countries or regions

two。 Automatic speech recognition in any language is supported when the language information is unknown, such as a multilingual content platform.

Combined with the actual business requirements and the proposed methods, Yi Dun AI team designed a multilingual speech recognition architecture based on LR-MoE, which supports multi-language and multi-demand intelligent speech content detection through a frame-level classifier with built-in model and flexible configuration.

The above architecture can support multilingual monolingual and code-switching speech recognition at the same time, reduce the confusion between languages, and improve the recognition effect by more than 10% in the actual multilingual business. It also supports users to actively configure language and adaptive recognition, enabling intelligent voice content risk control for overseas enterprises.

03.

Regular guests of the academic summit: Yi Dun AI Laboratory

NetEase Dun, as an one-stop digital content risk control brand of NetEase Group, provides professional and reliable security services for customers of digital business, covering three major areas: content security, business security and mobile security. ensure customer business compliance, soundness and safe operation in an all-round way.

NetEaseDun has long realized that technological innovation can bring geometric increase in value to products and services, and set up NetEaseDun AI Lab. Two papers selected this time are from this team. As a technical team at the forefront of artificial intelligence research under NetEase Shield, AI Lab is committed to building comprehensive, rigorous, secure and reliable AI technical capabilities around refinement, lightweight and agility, and constantly improve the level of digital content risk control services. Prior to this, the team has won a number of AI algorithm competitions and important awards:

The highest Class A Certificate of the Flag recognition track of the first China artificial Intelligence Competition in 2019

The highest A-level certificate of the video depth forgery test track of the second China artificial Intelligence Competition in 2020

Two highest A-level certificates of video depth forgery detection and audio depth forgery detection track in the third China artificial Intelligence Competition in 2021

"Innovation Star" and "Innovation figure" of China artificial Intelligence Industry Development Alliance in 2021

In 2021, the 16th National Conference on Man-Machine Voice Communication (NCMMSC2021) "long and short Video multilingual Multimodal recognition Competition"-Chinese long and short Video Live Voice keywords (VKW) double track champion

In 2021, he won the first prize of Science and Technology Progress Award issued by Zhejiang Provincial Government.

The winner of the 2022 ICPR Multimodal subtitle recognition Competition (Multimodal Subtitle Recognition, MSR Competition, the first multimodal subtitle recognition contest in China) track 3 "Multimodal subtitle recognition system with Visual and Audio Integration"

In 2023, the paper "Improving CTC-based ASR Models with Gated Interplayer Collaboration (CTC-based model improvement to achieve a stronger model structure)" was selected into ICASSP.

NetEaseDun AI Lab, which has become a regular guest of top academic conferences, will also conduct in-depth research in various AI directions, including voice AI, and continue to use technology to create more space for services.

Jarvis seemed out of reach when Iron Man was released in 2008. Looking back, Jarvis may seem a little unimaginative. To be sure, we are on the eve of the technological explosion, and the research of all the underlying technologies such as 5G, artificial intelligence, Internet of things, big data, cloud computing and so on will continue to produce a variety of products and services that can be used in daily life in the next few years.

In the research and application of voice AI in the field of digital content risk control, NetEase Yi Dun not only pursues speed, we also hope that our pace is steady and firm, and effectively create value for our customers.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.