Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the application of big data chi-square test in association analysis?

2025-03-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article will explain in detail about the application of big data chi-square test in association analysis. The content of the article is of high quality, so the editor will share it with you for reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.

The essence of case/control association analysis is to look for SNP loci with different genotype distribution between the two groups. These loci are candidate association signals. The commonly used analysis methods are as follows.

Chi-square test

Fisher's exact test

Logical regression

Chi-square test is a widely used hypothesis test, which is a kind of non-parametric test, which is suitable for the analysis of classified variables. Formally, the data is a table made up of two classified variables corresponding to rows and columns, as shown below

For case/control association analysis, we have two classification variables, the first is the grouping of samples, there are two groups: case and control; the second is the category of Allel or genotype, for Allele, there are two kinds, major and minor allele. For genotypes, there are AA, Aa and aa3 species in the above picture. Of course, in the actual analysis, genetic models will be considered to further classify genotypes. The commonly used genetic models are as follows.

Domanant model, dominant genetic model, can cause disease as long as there are mutations, so heterozygous mutations and homozygous mutations are classified into two categories, the first is AA and Aa, and the second is aa.

Recessive model, recessive model, only homozygous mutations can cause disease, genotypes are also divided into two categories, the first is homozygous mutation AA, and the second is non-homozygous mutation, Aa and aa.

Additive model, additive model, the number of mutation sites will affect the phenotypic value of traits, and it is a cumulative relationship. The number of homozygous mutations is 2 times that of heterozygous mutations, and the corresponding traits are different. Genotypes are divided into three categories, AA,Aa and aa.

Multiplicative model, multiplication model, the number of mutation sites will affect the phenotypic value of traits, and it is a multiplication relationship. The number of homozygous mutations is 4 times that of heterozygous mutations, and the corresponding traits are different. Genotypes are divided into three categories, AA,Aa, aa.

According to the classification, the above models can be divided into three categories, the first is dominant genetic model, the second is recessive genetic model, the third is additive, multiplicative model and conventional genotype classification, these three models are divided into three genotypes.

For chi-square test, chi-square statistics need to be calculated according to the frequency distribution in the table. The formula is as follows.

A represents the actual frequency and T represents the theoretical frequency. From the formula, we can see that chi-square statistics represent the difference between the actual value and the theoretical value. Look at a specific example.

GenotypeAAAaaaCase30

1555

Control281260

The above figure shows the frequency distribution of two groups of genotypes actually observed, and the corresponding frequency distribution is as follows.

GenotypeAAAaaaCase30%15%55%Control28%12%60%

From the numerical point of view, it can be seen that there are differences in distribution between the two groups, but it is not known whether the difference is caused by the sampling error or the real difference. Assuming that there is no difference between the two groups, merge the samples and calculate the corresponding frequency again, which is 29% and 13.5% respectively. These three values are the theoretical frequency, and the theoretical frequency is calculated according to this frequency.

GenotypeAAAaaaCase100 x 29% 100 x 13.5% 100 x 57.5%Control100 x 29% 100 x 13.5% 100 x 57.5%

Then the chi-square value is calculated by the formula, and the final result is 0.61969, and the corresponding R code is as follows

As you can see from the figure above, for the chi-square test, in addition to the chi-square value X-squared, there are two values, df and p-value. Df represents the degree of freedom, the value is (rows-1) X (number of columns-1), the above data is the table of 2X3, the degree of freedom is 2. Why consider the degree of freedom?

We should start with the definition of chi-square distribution. For N variables that conform to the standard normal distribution, the sum of squares obeys the chi-square distribution. The degree of freedom refers to the N here, and the chi-square distribution of different degrees of freedom is different, as shown in the following figure.

The figure above shows the density distribution of chi-square values under different degrees of freedom, and there is a great difference between different degrees of freedom, so we need to define the corresponding degrees of freedom before we can use chi-square values to make a judgment. Using the degrees of freedom and chi-square values, we need to query the chi-square value distribution table to get the corresponding p value. The corresponding operation code in R is as follows

1-pchisq (0.6196902, df = 2)

[1] 0.7335606

Pchisq represents the cumulative distribution function of chi-square value and represents the probability that the chi-square value is less than 0.6196902. The probability greater than the threshold is shown in the chi-square distribution table as follows

The smaller the chi-square value, the greater the corresponding probability. When the degree of freedom is 2, the chi-square critical value of case/control 0.05 is 5.99. the chi-square value of the above example is less than the critical value, indicating that the probability of occurrence is more than 0.05. there is no significant difference between the two groups of rejecting the original hypothesis.

Although Chi-square test is widely used, there are still some restrictions. The sample size must be greater than 40, and the minimum frequency must not be less than 5. Here the frequency refers to the theoretical frequency.

For 2X2 data, when the requirements are not met, it is recommended to use Fisher exact test for analysis.

So much for the application of big data chi-square test in association analysis. I hope the above content can be helpful to you and learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report