Investigation and example Analysis of the combined use of ASV and TTS Modules in RTVC 02/07 Update SLTechnology News&Howtos

Investigation and example Analysis of the combined use of ASV and TTS Modules in RTVC

2026-02-07 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article for you to show the RTVC ASV and TTS module combined use of research sample analysis, concise and easy to understand, absolutely can make you shine, through the detailed introduction of this article I hope you can gain something.

0. description

I don't know how to overcome the Unseen Speaker problem when ASV's output SV Vector is applied to TTS.

Background Description:

Whether it's M2VoC or a transmuted version of Cross-lingual TTS, you can

ASV is used to get the timbre vector.

This vector doesn't necessarily have to represent timbre, it just needs to be concentrated on the same person.

Then this vector is combined with text to participate in TTS training, so that TTS is familiar with the vector

But if you haven't seen the speaker, you need ASV to extract more accurate, and TTS places to see more people.

So ASV takes the vector and finds the nearest one and replaces it with that one.

The extraction vector is the vector of the current sentence at training time, but the Inference can take 20 sentences randomly and then take the average.

So, look at the literature and discuss it.

1. summary phenomenon

SVV leads to Good cases

SVV causes Bad cases

are recorded, observed and binarized.

2. Pre-survey thoughts 2.1. Increased data

Don't change your thinking, increase VCTK similar thinking, train carefully

The main contribution can be seen in

Collection of public data sets

processing

and using

Construction of Final Test Set

2.2. SVV Find nearest

Instead of extracting SVV itself, look for his nearest one.

2.3. Multiple ASVs

One catch is not enough, reference is few, use multiple

Many of them can be in Chinese or English

2.4. GST

SVV is obtained using ASV, and then SVV is expressed as a weighted sum of several GSTs through Attention instead of directly using SVV, and then TTS is involved.

2.5. ASV Fine-Tune

Allow ASV to modify gradient backpropagation during training

However, this method TTS corpus is only 100 speaker level, while ASV corpus is 7000 level, so it is not easy to train.

3. LibriSpeech TTS

But there's been good cross-language work done before, and it hasn't involved this many speakers.

But use it first, see if it works.

The above content is an example of the research analysis of the combination of ASV and TTS modules in RTVC. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserves, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.