In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article is about how to realize the application of graph neural network in TTS. The editor thinks it is very practical, so I share it with you. I hope you can get something after reading this article. Let's take a look at it with the editor.
1. GNN concept 1.1. The concept of graph neural network
G = {V, E}, directed, undirected, weighted, unweighted, isomorphic, heterogeneous (edges / points of different structures / meanings)
Why use it? The data has the information meaning of non-Euclidean distance.
Euclidean distance: for example, the CNN that identifies a cat picture can be described by a simple distance (no edge is needed)
Graph neural network: learn a state feature (state embedding) that contains the information of a neighbor node. The neighbor is represented by an edge. With an edge, it is upgraded to a graph.
1.2. Specific structure of GNN
By introducing the iterative function F (which can eventually make the graph stable or unstable, flow is the key), H represents the information of the general graph.
The graph neural network is divided into propagation step and output step.
Loss can train the value of the point, the value of the edge, and the value of the edge and the value of the point communicate with the whole picture
2. GraphTTS-12.1. The goal of GraphTTS
Modeling prosody
Similar to the complex features introduced by NLP
The structure of the graph is consistent with the analysis of the text by expert knowledge, and GNN is more suitable.
Directly replace the original Encoder structure
2.2. GraphTTS structure
Define dots and edges in text: English letters are dots, virtual dots are word dots and period dots. Order edge, reverse edge, parent node word edge, parent node sentence edge
The difference between the # tag and the # tag is that the word boundary information is displayed by taking advantage of the structure.
From the code point of view, RNN in Encoder is changed to GCN, with propagation step and output step.
3. GraphTTS-23.1. GAE
Keep the Encoder of Tacotron, and design the information relationship between syntax and prosody of GAE module separately.
The input of GAE is boundary information + text, and the output is Memory as Attention (can be spliced with Encoded Output to make an information residual)
4. Experimental results of two structures
Using the diagram, MOS will be all right.
GGNN works better than GCN.
Using diagrams, attention is easy to make mistakes, so GAE is the best in all aspects.
But in fact, in the GAE model, the natural structure of GAE module and input are conducive to capture prosodic information and express pronunciation information together with Encoder. In fact, it is not a feature decoupling idea, but a post-net residual idea. With this structure, you can strengthen it.
5. Doubt
Where are style sequence and style embedding spliced into the features of Encoder
6. GraphSpeech
6.1 Core work
Relation Encoder, model the grammatical relationship of two words, represent their grammar dependency tree-> grammar dependence graph (one-way edge becomes bi-directional and have different weights); the shortest path between nodes in the graph represents the relationship between two words (because distance is an intuitive measure of the gap); the distance between words is determined (build a self edge between yourself and yourself), and the char level is the distance between the words you belong to. Finally, we can get the dependency relationship between any two words (N * Nmur1) sequence, Rij, Rii-> Cij, Cii; N * Nmuri sequence N * Nmurl through the same Bi-GRU, calculate the Cij.
Graph Encoder, improve Transformer to make an Attention based on syntax, Cij improved dot-score or add-score; is equivalent to a more accurate Positional Encoding
7. Simplification of Idea7.1 GCN in TTS
According to Yixuan's idea, Yixuan wants to use GCN directly to use word dependency, phoneme + bert_out + dependency-> linguistic feature (but this method is more difficult than GraphSpeech and cannot be trained)
Only the GCN of word parent node information is not easy to adjust, so this method needs to simplify the structure and weight of GCN.
Determine the total class of edges, and then determine which shared edges (the same edges) are in the same class. Because the grammatical dependency of the text is very regular and unified, we can use this to simplify the edge weights of graph neural network.
The part of speech of words should also be reflected in node, with a certain degree of dim sharing.
This can be called TTS-Simplify-GCN, and the Attention analogous to TTS doesn't need to be that powerful.
8. Implementation details 8.1. GNN library
PyG
DGL
The above is how to realize the application of graph neural network in TTS. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.