What is the method and principle of constructing R language evolution tree? 04/07 Update SLTechnology News&Howtos

What is the method and principle of constructing R language evolution tree?

2025-04-07 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article introduces in detail "what is the construction method and principle of R language evolutionary tree" for everyone. The content is detailed, the steps are clear, and the details are properly handled. I hope this article "what is the construction method and principle of R language evolutionary tree" can help you solve your doubts. Let's go deeper and learn new knowledge together with the ideas of Xiaobian.

Methods and Principles of Constructing Phylogenetic Trees

Construction of Phylogenetic Tree

(1) Data preparation

Phylogenetic analysis needs to describe the evolutionary relationship between different species or genes by constructing phylogenetic trees, which can be constructed by nucleotide sequences of homologous DNA or amino acid sequences of homologous protein molecules.

(3) Sequence alignment

Alignment and correction of the original sequences are required to ensure sequence homology and the reliability of the resulting phylogenetic relationships. Software for automatic sequence alignment includes Clustalw, MAFFT, MUSCLE, etc.

(4) Conservative regions are used to construct phylogenetic trees.

Conservative region selection is an important step in phylogenetic analysis. Conservative sites and full-length sequences can be selected for analysis, but when the sequence differences are large, it is recommended to retain conservative sequences for phylogenetic tree construction. Common software for preserving conserved regions of sequences include Gblock, MEME, etc.

Selection of methods for constructing evolutionary trees ML, Maximum likelihood NJ, Neighbor-Joining MP, Maximum parsimony ME, Minimum Evolution Bayesian inference UPGMA is not commonly used.

Prerequisite: In the evolutionary process, the number of divergences in each generation is the same, that is, the rate of base or amino acid substitution is equal and identical.

UPGMA calculation principle and process:

(1) A t×t square matrix is formed by the pairwise distances of all the compared classification units according to the obtained distance coefficients, that is, a distance matrix M is established.

(2) For a given distance matrix, find the minimum distance Dpq.

(3) Define the branching depth Lpq=Dpq/2 between group p and group q.

④ If p and q are the last group, the lateral clustering process is completed, and the negative side merges p and q into a new group r.

(5) Define and calculate the distance Dir=(Dpi+Dqi)/2 from the new group r to the other groups i(i≠p and q).

(6) Go back to the first step, eliminate p and q in the matrix, add a new group r, reduce the order of the matrix by one, and repeat until the final group is reached.

UPGMA method is intuitive and simple, with fast operation speed and wide application. Its disadvantage is that systematic errors will be introduced in the process of building trees when the molecular evolution rate is high.

NJ (Neighbor Joining Method)

It's a way of deducing superposition trees. Conceptually the same as UPGMA, but with four differences

a. The NJ method does not require the distance to conform to the excess quantity characteristic, but requires that the data should be very close or conform to the superposition condition, that is, the method requires the correction of the distance.

b. Neighborhood method connects nodes between taxa in clustering, not taxa themselves.

c. In NJ method, the original distance data is used to estimate the distance matrix between all terminal nodes in the phylogenetic tree, and the corrected distance is used to determine the connection order between nodes.

d. In reconstructing the phylogenetic tree,NJ method cancels the assumption made by UPGMA method, and considers that the frequency of divergence can be different in this clade.

① For each terminal i in a given distance matrix, calculate the net divergence (Ri) from other taxa using the following equation (t: number of taxa in the matrix)

(2) establishing a rate correction distance matrix M, the elements of which are determined by the following formula:

③ Define a new node u,u has three branches connected to nodes i,j and the rest of the tree respectively, and Dij is the smallest distance in the matrix. The branch length from u to nodes i and j is defined as

4. Define the distance from u to the other nodes k of the tree (k≠ all nodes except i and j):

Remove the distance between i and j from the distance matrix and reduce the matrix by one order.

If the matrix still has more than two nodes, repeat steps ①-XV, except for the branch lengths of the outermost two nodes to determine, the rest of the nodes on the tree are determined, and finally the branch lengths of the remaining two Sy=Dij

The third method is Maximum Parsimony Method.

Based on Ockham's philosophical principle, this principle holds that the best theory to explain a process is the one that requires the least number of assumptions.

Methods All possible topological structures are computed, and the topological structure with the minimum number of substitutions is computed as the optimal tree.

Features are used to analyze sequences such as insertions, deletions, etc. When there are many reversions or parallel mutations in the analyzed sequence and the number of tested sequence loci is relatively small, the maximum parsimony method may give an unreasonable or erroneous phylogenetic tree deduction result.

Fourth: Maximum Likelihood Method

Rationale: This method was first used by geneticist and statistician Sir Ronald Fisher between 1912 and 1922. The basic idea is that when n sample observations are randomly extracted from the model population, the most reasonable parameter estimator should maximize the probability of extracting the n sample observations from the model, not the parameter estimator that best fits the model to the sample data, as in least squares estimation.

Methods: Select a specific alternative model to analyze a given set of sequential data, so that the likelihood of each topology obtained is the maximum, and then select the topology with the maximum likelihood as the optimal tree (so the analysis time is longer)

Features: Maximum likelihood method has a good statistical theoretical basis, is a relatively mature statistical method. After selecting a reasonable model, the maximum likelihood method can derive a good evolutionary tree result. However, for sequences with low similarity, NJ often has long-branch attraction (LBA), which sometimes seriously interferes with the construction of phylogenetic trees.

Read here, this article "R language evolutionary tree construction method and principle is what" article has been introduced, want to master the knowledge point of this article still need to practice to understand, if you want to know more related content of the article, welcome to pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.