In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly explains "the principle of Chi-square and the implementation of python code". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Next, let the editor take you to learn "the principle of chi-square and python code implementation"!
The principle of chi-square test:
Chi-square test is the degree of deviation between actual observations and theoretical inferences of statistical samples. the degree of deviation between actual observations and theoretical inferences determines the magnitude of chi-square values. the greater the chi-square value, the greater the deviation; on the contrary, the smaller the deviation; if the two values are completely equal, the chi-square value is 0, indicating that the theoretical values are completely consistent, that is, irrelevant. That is, the greater the deviation, the greater the correlation.
Note: Chi-square test is for classified variables.
Test method (independent sample four table):
Suppose there are two classification variables X and Y, whose ranges are {x1, x2} and {y1, y2}, respectively, and their sample frequency contingency table is
If the argument to be inferred is H1: "there is a relationship between X and Y", independence test can be used to examine whether the two variables are related, and the reliability of this judgment can be given more accurately. The specific method is to calculate the test statistics from the data in the table.
The value of.
Where An is the actual value, that is, the four data in the first four-grid table, and T is the theoretical value, that is, the four data in the four-grid table of theoretical values.
X2 is used to measure the difference between the actual value and the theoretical value (that is, the core idea of the chi-square test), and contains the following two information:
The absolute magnitude of the deviation between the actual value and the theoretical value (the difference is magnified due to the existence of the square)
The degree of difference and the relative size of the theoretical value
Critical value of chi-square distribution
Now that we have got the x2 value, how do we know if the x2 value is reasonable? In other words, how do you know if the irrelevant hypothesis is reliable? The answer is by querying the table of critical values of chi-square distribution.
Here we need to use the concept of a degree of freedom, the degree of freedom is equal to V = (number of rows-1) * (number of columns-1), for a four-grid table, degree of freedom V = 1
For V = 1, the critical probability of chi-square distribution is:
If a category feature has multiple classifications, then the degree of freedom will change, and the critical value and critical probability of the corresponding chi-square distribution will also be adjusted.
Example 1: category feature correlation test
You can understand this example as whether the target variable belongs to or does not belong to entertainment and whether the independent variable includes whether Wu Yifan is related, assist in the variable screening of categorical features, or simply look at whether the two features are related to each other.
For example, suppose we have a bunch of news headlines, and we need to determine whether a word (such as Wu Yifan) in the title is related to the category of the news (such as entertainment). We only need simple statistics to get such a four-grid table:
The first information we get through this four-grid table is: whether the title contains Wu Yifan does have a statistical difference as to whether the news belongs to entertainment, and the news including Wu Yifan has a higher proportion of entertainment. However, we cannot rule out whether this difference is caused by sampling errors. Well, first of all, assuming that whether the title contains Wu Yifan is independent of whether the news belongs to entertainment, the probability of randomly selecting a news title that belongs to the entertainment category is (19 + 34) / (19 + 34 + 24 + 10) = 60.9%.
Def Chi2 (df, total_col, bad_col):
#: param df: a data box containing all sample totals and bad sample totals
#: param total_col: the number of all samples
#: param bad_col: the number of bad samples
#: return: Chi-square value
Df2 = df.copy ()
# find out the overall bad sample rate and good sample rate in df
BadRate = sum (Df2 [bad _ col]) * 1.0/sum (Df2 [total _ col])
# when all samples have only good or bad samples, the chi-square value is 0
If badRate in [0,1]:
Return 0
Df2 ['good'] = df2.apply (lambda x: X [total _ col]-x [bad _ col], axis = 1)
GoodRate = sum (df2 ['good']) * 1.0 / sum (Df2 [total _ col])
# expected number of bad (good) samples = total number of samples * average percentage of bad (good) samples
Df2 ['badExpected'] = DF [total _ col] .apply (lambda x: x*badRate)
Df2 ['goodExpected'] = DF [total _ col] .apply (lambda x: X * goodRate)
BadCombined = zip (df2 ['badExpected'], Df2 [bad _ col])
GoodCombined = zip (df2 ['goodExpected'], df2 [' good'])
BadChi = [(I [0]-I [1]) * * 2 badChi I [0] for i in badCombined]
GoodChi = [(I [0]-I [1]) * * 2 / I [0] for i in goodCombined]
Chi2 = sum (badChi) + sum (goodChi)
Return chi2 so far, I believe you have a deeper understanding of the "chi-square principle and python code implementation", might as well come to the actual operation of it! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 287
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.