How to understand Metastats Analysis in R language 07/02 Update SLTechnology News&Howtos

How to understand Metastats Analysis in R language

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Today, I will talk to you about how to understand Metastats analysis in R language, which may not be understood by many people. In order to make you understand better, the editor has summarized the following contents for you. I hope you can get something according to this article.

Anosim, Adonis, MRPP and other community-based inter-group difference analysis can quickly evaluate the effectiveness of grouping. However, sometimes we want to know more about the differences in microbial communities among different groups, that is, those species are significantly different. At this time, the easiest way we can think of is to test the significance of all species according to groups. at this time, we do multiple tests on a data set, and we need p-value correction to get more accurate results.

Two tools commonly used to find different species in different groups are Metastats and LEfSe. Putting aside the two tools themselves, in terms of algorithm principle, Metastats is actually the integration of nonparametric multiple test and p-value correction, while LEfSe is the integration of Metastats and LDA discrimination. Of course, because of the nonparametric t-test used by Metastats, only two packets can be analyzed, while LEfSe can analyze more than two packets because of the Kruskal-Wallis rank sum test used. When we understand their principle, we can actually choose the appropriate method for analysis in R instead of sticking to the two tools themselves.

P value correction

Hypothesis testing is a kind of probability judgment, and we reject the hypothesis because of the occurrence of low probability events. However, if you make this kind of probability judgment many times at the same time, you can also make mistakes. For example, when we compare multiple independent correlations, if there are k variables, then we need to do naughk (kmur1) / 2 correlation analysis, and each correlation is tested once. In the case of significant level 0.05 (confidence level 0.95), the correct probability is 0.95, while the probability of n independent tests is 0.95n. If the probability of correctness of all test results is greater than 0.95, it is necessary to adjust the significant level or the more commonly used p-value correction. A common method is Bonferroni correction, which is based on the principle of doing n independent hypothesis tests in the same data set, then the significant level of each test should be 1 of that of only one test. For example, we only do the correlation test of two variables, then the significant level is 0.05. if we do the correlation test of 5 variables in a data set at the same time, because we want to test 10 times, then the significant level should be 0.005, so the significant test p value after Bonferroni correction is 10 times of the original p value.

In R, p-value correction can use the p.adjust () function, which is used as follows:

P.adjust (p, method=p.adjust.methods, n=length (p))

Where p is the result of significance test (numerical vector), n is the number of independent tests, generally length (p), and method is the correction method. The commonly used methods are "bonferroni", "holm", "hochberg", "hommel", "BH", "fdr", "BY" and "none". Among them, the "bonferroni" just mentioned is the most conservative, that is, the p value becomes larger after correction, which is generally not very common, and the other methods are all kinds of correction methods.

The corrected p value is often called Q value, and the p value corrected by Benjamini-Hochberg (BH) method is also called error detection rate (false discovery rate,FDR).

Next, I'll use the same data as an example. Look for species with significant differences between different groups: # read the extracted OTU_table and environmental factor information data=read.csv ("otu_table.csv", header=TRUE, row.names=1) envir=read.table ("environment.txt", header=TRUE) rownames (envir) = envir [, 1] env=envir [,-1] # screen high abundance species and standardize species data means=apply (data, 1, mean) otu=data [names (meansmeans > 10],] otu=t (otu) # clustering kms=kmeans (env, centers=3) based on geographical distance Nstart=22) Position=factor (kms$cluster) newotu=data.frame (Group=Position, otu) # performs multiple Kruskal-Wallis rank sum test and p-value correction pvalue=t (otu) [, 1:2] colnames (pvalue) = c ("p-value", "q-value") for (I in 2:ncol (newotu)) {t=kruskal.test (newotu [, I] ~ newotu [, 1]) pvalue [I-1] = t$p.value} pvalue [, 2] = p.adjust (pvalue [, 1], method= "BH", n=nrow (pvalue) pvalue=pvalue [order (pvalue [) 1]),]

Next, we can screen and visualize species with significant differences:

# screening species with Q less than 0.05top=pvalue [pvalue [, 2]

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.