How to realize the Application of Graph Neural Network in TTS 04/19 Update SLTechnology News&Howtos

How to realize the Application of Graph Neural Network in TTS

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article is about how to realize the application of graph neural network in TTS. The editor thinks it is very practical, so I share it with you. I hope you can get something after reading this article. Let's take a look at it with the editor.

1. GNN concept 1.1. The concept of graph neural network

G = {V, E}, directed, undirected, weighted, unweighted, isomorphic, heterogeneous (edges / points of different structures / meanings)

Why use it? The data has the information meaning of non-Euclidean distance.

Euclidean distance: for example, the CNN that identifies a cat picture can be described by a simple distance (no edge is needed)

Graph neural network: learn a state feature (state embedding) that contains the information of a neighbor node. The neighbor is represented by an edge. With an edge, it is upgraded to a graph.

1.2. Specific structure of GNN

By introducing the iterative function F (which can eventually make the graph stable or unstable, flow is the key), H represents the information of the general graph.

The graph neural network is divided into propagation step and output step.

Loss can train the value of the point, the value of the edge, and the value of the edge and the value of the point communicate with the whole picture

2. GraphTTS-12.1. The goal of GraphTTS

Modeling prosody

Similar to the complex features introduced by NLP

The structure of the graph is consistent with the analysis of the text by expert knowledge, and GNN is more suitable.

Directly replace the original Encoder structure

2.2. GraphTTS structure

Define dots and edges in text: English letters are dots, virtual dots are word dots and period dots. Order edge, reverse edge, parent node word edge, parent node sentence edge

The difference between the # tag and the # tag is that the word boundary information is displayed by taking advantage of the structure.

From the code point of view, RNN in Encoder is changed to GCN, with propagation step and output step.

3. GraphTTS-23.1. GAE

Keep the Encoder of Tacotron, and design the information relationship between syntax and prosody of GAE module separately.

The input of GAE is boundary information + text, and the output is Memory as Attention (can be spliced with Encoded Output to make an information residual)

4. Experimental results of two structures

Using the diagram, MOS will be all right.

GGNN works better than GCN.

Using diagrams, attention is easy to make mistakes, so GAE is the best in all aspects.

But in fact, in the GAE model, the natural structure of GAE module and input are conducive to capture prosodic information and express pronunciation information together with Encoder. In fact, it is not a feature decoupling idea, but a post-net residual idea. With this structure, you can strengthen it.

5. Doubt

Where are style sequence and style embedding spliced into the features of Encoder

6. GraphSpeech

6.1 Core work

Relation Encoder, model the grammatical relationship of two words, represent their grammar dependency tree-> grammar dependence graph (one-way edge becomes bi-directional and have different weights); the shortest path between nodes in the graph represents the relationship between two words (because distance is an intuitive measure of the gap); the distance between words is determined (build a self edge between yourself and yourself), and the char level is the distance between the words you belong to. Finally, we can get the dependency relationship between any two words (N * Nmur1) sequence, Rij, Rii-> Cij, Cii; N * Nmuri sequence N * Nmurl through the same Bi-GRU, calculate the Cij.

Graph Encoder, improve Transformer to make an Attention based on syntax, Cij improved dot-score or add-score; is equivalent to a more accurate Positional Encoding

7. Simplification of Idea7.1 GCN in TTS

According to Yixuan's idea, Yixuan wants to use GCN directly to use word dependency, phoneme + bert_out + dependency-> linguistic feature (but this method is more difficult than GraphSpeech and cannot be trained)

Only the GCN of word parent node information is not easy to adjust, so this method needs to simplify the structure and weight of GCN.

Determine the total class of edges, and then determine which shared edges (the same edges) are in the same class. Because the grammatical dependency of the text is very regular and unified, we can use this to simplify the edge weights of graph neural network.

The part of speech of words should also be reflected in node, with a certain degree of dim sharing.

This can be called TTS-Simplify-GCN, and the Attention analogous to TTS doesn't need to be that powerful.

8. Implementation details 8.1. GNN library

PyG

DGL

The above is how to realize the application of graph neural network in TTS. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.