How to mine further information from PPI network 04/17 Update SLTechnology News&Howtos

How to mine further information from PPI network

2025-04-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to further mine information from the PPI network, I believe that many inexperienced people are at a loss about this. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

After we get the protein interaction information from the database, we can build a protein-protein interaction network, but this network is very complex, with a large number of nodes and connections, if taken as a whole, it is difficult to dig out any biologically valuable information, so we need some algorithms to dig deep.

With the continuous improvement of information flux in various databases, web-based analysis methods are becoming more and more popular, such as protein mutual network, gene co-expression network, transcription factor regulatory network, pathway network and so on. In order to better understand the subsequent data mining algorithms, we should first have some basic understanding of the properties of the network.

From the perspective of data structure, what we call network network belongs to the data structure of graph Graph. Network is a more intuitive description, that is, the connection between points. In algorithm, in order to accurately describe a network, usually with the help of adjacency matrix, the following is shown.

In the network, according to whether the connection of nodes has direction or not, it can be divided into directed graph and undirected graph. The two nodes connected by one line in undirected graph act on each other, such as gene co-expression network. The two genes are co-expressed genes, while in the directed graph, the wiring is directional, such as transcription factor regulatory network, transcription factor regulatory genes, so the connection is directed to a gene by transcription factors.

The description of undirected graph is undirected graph, and the description of directed graph is directed graph. PPI networks are usually classified as undirected graphs because the interactions of proteins are mutual.

In addition to the direction of the connection, according to the corresponding value of the connection, the network graph can be divided into weighted and unweighted. Taking the gene co-expression network as an example, the connection in the non-weighted graph is a qualitative description, and the two genes have the trend of co-expression. The connection can be connected by a line, while the weighted graph is a quantitative description, and the value of the corresponding edge of the co-expression coefficient between the two genes is different in visualization. The thickness of the corresponding edges is also different.

The adjacency matrix can easily describe any kind of network, as shown in the above figure, the adjacency matrix is a two-dimensional matrix and a square matrix, and the rows and columns represent the nodes in the graph. In the unweighted graph, 0 represents no connection between the two nodes, 1 represents the connection between the two nodes; in the weighted graph, the value of each cell corresponds to the value of each edge.

As far as the network is concerned, you need to understand the following basic concepts

1. Degree

The network is composed of nodes and edges. For a node, the number of lines connected by the node is the degree of the node, which is called the degree. For a directed graph, according to the defense line of the line, the degree is divided into the degree of entry and the degree of exit, as shown below.

The number marked on each node in the figure is the degree of that node.

2. Shorest path

The shortest path represents the shortest distance between two nodes. in a network, there can be many paths from one node to another, among which the one with the least number of nodes is called the shortest path, as shown below.

The shortest path to A to B above is 5.

3. Closeness centrality

This statistic is used to measure the importance of nodes and is defined based on the shortest path. The formula is as follows

4. Betweenness centrality

Similar to closeness centrality, it is also used to characterize the importance of nodes. The formula is as follows

In the picture above. If you delete either of B and C, A can connect to E, but not if you delete D, so D is more important.

5. Density

The density represents the ratio of the actual number of connections to the theoretical maximum number of connections in the network. For a network with n nodes, the maximum variable is that any two nodes are connected, a total of n (nmur1) / 2, as shown below.

Density is used to measure the density of a network.

6. Clustering Coefficient

Aggregation coefficient, similar to density, also known as transitity, has two definitions. The first is called local clustering coefficient, which is defined for a single node. For a node, the value of this statistic is the density of a network of neighboring nodes directly adjacent to that node, as shown below.

In the first network in the figure above, all nodes form a clique, that is, a fully connected graph, and there is a connection between any two nodes. Local clustering coefficient can be regarded as a measure of how close the network composed of neighboring nodes is to the fully connected graph. Values range from 0 to 1, and the closer it is to 1, the closer it is to a fully connected graph.

On this basis, for a network, there is also the concept of average clustering coefficient, that is, to calculate the local clustering coefficient of each node, and then take the average, the formula is as follows

The second is for the whole network, called global clustering coefficient, this value is defined on the basis of triangle graph, triangle graph literally translated is a triangle, that is, a network of three nodes, as shown below

As shown in the figure above, if the network consisting of three nodes is a closed triangle, it is called closed triangle graph, and if one of the edges is missing, it is called open triangle graph.

There are two ways to define global clustering coefficient

Some literature studies have found that the real-world network is a scale-free network, Chinese is a scale-free network, which means that in this network, most nodes have very low degrees, and only some nodes are useful with very high degrees, as shown below

The network in the figure above is a scale-free network. Only the degree of the yellow node is high, and the degree of the blue node is very low. In the whole network, most of the nodes are blue. If you draw the node degree distribution map of the network, it should be a trend as follows.

The Abscissa is the degree, the ordinate is the number of nodes, the nodes with very low degrees account for the majority, and the nodes with high degrees are only a few. Of course, this description is a qualitative description. In order to describe accurately, the concept of power law distribution is put forward, that is, the corresponding expression of the above distribution map is

X represents degrees and Y represents the corresponding number of nodes. Interestingly, taking logarithms of X and Y at the same time can be transformed into a linear equation, which is derived as follows

The distribution after taking the logarithm is as follows

After logarithmic conversion, the value of each coefficient can be determined by linear fitting. In the previous WGCNA, choosing the best power is actually this principle. By comparing the size of R2 value of linear fitting under different power values, choose a value with the best fitting effect.

In a complex network, there will be some areas with high density, which are called community or module, as shown below.

Within the community, the density of the connection is higher, while the connection of the area part is less. Community is considered to be a collection of biological significance. For PPI networks, its modules usually has the following two biological meanings

Protein complex

Protein complex, which is composed of multiple proteins and then plays a biological role.

Functional module

Functional modules, such as proteins in the same pathway, must interact more closely.

So after getting the network, we need to identify communities. At present, there are a variety of algorithms available. In PPI networks, the following algorithms are commonly used.

MCODE

MCL

Nwewan-Girvan fast greedy algorithm

After reading the above, do you know how to further mine information from the PPI network? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.